We're sunsetting PodQuest on 2025-07-28. Thank you for your support!

The Frontier of Spatial Intelligence with Fei-Fei Li

2024/9/19

a16z Podcast

AI Deep Dive AI Chapters Transcript

People

Fei-Fei Li

Justin Johnson

Martin Casado

总合伙人，专注于人工智能投资和推动行业发展。

Topics

Fei-Fei Li：空间智能是AI领域下一个重要的发展方向，它与语言模型有根本的不同，其重要性堪比语言。她认为，我们正处于AI领域的爆炸式发展时期，空间智能将成为未来AI发展的重要方向，并能应用于虚拟现实、增强现实和机器人等领域。她还强调了数据驱动模型的重要性以及计算能力对AI发展的影响。她回顾了自己在AI领域的科研历程，以及对空间智能的长期关注。她认为，空间智能模型的成功标志是其广泛应用，并能解决现实世界中的问题。 Justin Johnson：他认为AI的未来发展方向是理解新数据，特别是来自现实世界的数据。他回顾了自己在AI领域的科研历程，以及对空间智能的长期关注。他认为，计算能力的提升对AI发展至关重要，并详细阐述了深度学习的成功以及NeRF模型对三维计算机视觉发展的影响。他认为，三维表示法更适合处理三维世界中的任务，并能更好地与用户交互。 Martin Casado：他介绍了Fei-Fei Li及其团队对AI发展做出的重要贡献，并对World Labs的成立背景和发展方向进行了阐述。他认为，空间智能模型可以应用于虚拟现实、增强现实和机器人等领域，并能创造新的媒体形式。

Deep Dive

Chapters

The discussion covers the evolution of AI from the last AI winter to the current explosion of consumer-grade AI applications, highlighting key watershed moments and the deepening of technology and industry adoption.

AI has come out of the last AI winter and seen the birth of modern AI.
Deep learning has shown possibilities like playing chess and language models.
The current moment is described as a literal Cambrian explosion in AI applications.

Shownotes Transcript

Translations:

中文

This is fundamentally philosophically to be a different problem.

The previous decade had mostly been about understanding data that already exist, but the next decade was was going to be about understanding new data.

visual spatial intelligence. So fundamental is as fundamental as language is like unrecking .

presence on Christmas. But every day, you know there's gonna some amazing new discovery, you some amazing new application or algorithm somewhere.

If we see something or if we imagine something, both can converge towards generating IT. I think we're in the middle of a cabra yan explosion.

To many, the last two years of A, I have felt like a lights witch, create and post dup ty three pre, post, being able to generate image with natural language and even print, post, translating any video with the click of a button.

But to some, like doctor father, I often referred to as the quote, got mother of A I and long term professor of computer science at stanford, by the way, taught some very well known researchers, like on karpathy, to people like the artificial intelligence on logs have existed on a multi decade mong continuum. And that continuum is destined to proceed into the physical spatial world. At least that's what fav and her co founders of new company world BBS believe.

And these four founders pioneer ecosystem in so many ways, from face image net to Justin Johnson s work on seam graphs, then mellon, housework on nerves, or even Christophers work on the precor to the cosine black and in delays episode, you'll get to hear from faa. And just then, as they explore this evolution with a exist general partner, our team from the very early seeds to the recent explosion of consumer grade A I applications and the key watershed moments along the way, will, of course, dives the wine now behind world labs, but also their choice to focus on spatial intelligence and what IT might take to build at that frontier, from algorithm locks to hardware. All right, let's get started.

As a reminder, the content here is for informational purposes only, should not be taken as legal, business tax or element advice or be used to evaluate any investment or security, and is not directed at any investors or potential investors in any a six cy fund. Please note that asic C E and ezoe illit may also maintain investments in the companies is discussed in this podcast. For more details, including a links to our investments, please see a six and cor com slack disclosure res.

Over over the last two years, we've seen this kind of massive rush of consuming ing eye companies and technology and has been fight wild. But you've been doing this now for decades. And so maybe worked a little bit about how we got here, kind of like your key contributions and insights along the way.

So IT is a very exciting moment, right? Just rooming back A I is in a very exciting moment. I personally have been doing this for two decades plus, and we have come out of the last A I winter.

We have seen the birth of modern ai. Then we have seen deep learning taking off, showing us possibilities like playing chess. But then we're starting to see the deepening of the technology and the industry adoption of some of the earlier possibilities like language models.

And now I think we're in the middle of a cabra, an explosion in almost a literal sense because now in addition to texts, you seen pixel videos, audio all coming with possible A I applications and model. So it's very exciting moment. I know you both .

so well and many people know you both well because you're so prominent in the field, but not everybody. And so maybe it's kind of worth just going through like you're backgrounds, just a kind of level at the audience.

Yeah, sure. So I first got in to A I at the end of my undergrad. I did matting computer science from undergrad. Kitta was awesome.

But then towards the end of that, there was this paper that came out that was, at the time, very famous paper, the cat paper from home, likely, and andrey and others that were a google brain at the time. And that was like the first time that I came across this concept of deep learning. And to me, I just felt like this amazing technology.

And that was the first time that I came across this recipe that would come to define the next more than decade of my life, which is that you can get these amazingly powerful learning algorithms that are vegan. eric. Couple of them with very large amounts of compute, couple them with very large amount of data.

And magic things started to happen when you copied those ingredients. So I first came across that idea around twenty eleven and twenty twelve fish, and I just thought, oh my god, this is gonna be what I want to do. IT was obviously got to go to grade school to do this stuff, and then saw that faa was at stanford, and one of the few people in the world at the time who was on that train.

And that was just an amazing time to be in deep learning and computer vision specifically because that was really the era when this went from these first nco bits of technology that we're just starting to work and really got developed and spread across a ton of different applications. So then over that time, we saw the beginning of language model in. We saw the beginnings of discrimination computer vision, where you could take pictures and understand what's in them in a lot of different ways.

We also saw some of the early bits of what we would now call genii general modeling, generating images, generating text. A lot of those allograft mic pieces has actually got figured out by the academic community. During my PHD years, there was a time I would just wake up every morning and checked the newspapers on archive and just be ready is like on rapping present on Christmas every day.

You know, there's going to be some amazing new discovery, some amazing new application or algorithm somewhere in the world. In the next two years. Everyone else in the world kind of came to the same realization of using AI. So get new Christmas present every day. But I think for those of us that have been in the field for a decade or more, we've sort of had that experience for a very long time.

I come to A I through a different ango, which is from physics, because my undergraduate background was physics. But physics was the kind of discipline that teaches you to think audience questions and think about what is the still remaining mystery of the world, of course, in physics and atomic world, you know, universe and all that.

But somehow that kind of training thinking got me into the audience ous question that really capture my own imagination, which is intelligence. So I did my P H D N A I N computation uri ze a caltech, so just that I actually didn't overlap. But we share the same on the matter. A caltech.

the same advisor.

yes, same advisor. Your undergrad adviser, my P H D adviser piao perona and my P H D. Time, which is similar to europe.

H D time was when A, I was still in the winter in the public eye, but I was not in the winter in my eye because it's that free spring hibernation. There's so much life. Machine learning, statistical modeling was really gaining power.

I think I was one of the native generation in machine learning. And A, I were, as I look at just this generation is the native deep learning generation. So machine learning was the precursor of deep learning.

And we were experimenting with all kinds of models. But one thing came out at the end of my P. H. D.

And the beginning of my assistant professor time, there was a overlooked elements of A I that is mathematically important to drive generalization. But the whole field was not thinking that way. And I was data because we were thinking about the intricate cy of baton models or kuro methods and all that.

But what was fundamental that my students and my lab realized, probably earlier the most people, is that if you let data drive models, you can unleash the kind of power that we haven't seen before. And that was really the reason we went on a pretty crazy bet on image that which is, you know, just forget about any scale we're seeing now, which is thousands of data points at that point. NLP community has their own data sets.

I remember you see a vine dataset or some dataset in NLP was IT was small comer vision community has the dataset, but all the order of thousands or tens of thousands were like, we need to drive to internet scale. And luckily, I was also the coming of edge of internet. So we were riding the wave. And that's when I came to stanford.

So these e pox are what we often talk about image, and is clearly the e pox that created or made popular and viable computer vision in the gene. I way we talk about two kind of core unlocks. One is the transformer room spacer, which is attention, that the table, the fusion.

Is that a fair way to think about this, which is there two algate, the comox that came from academia, google, and that's where everything comes from? Or has IT been more deliberate? Or have there been other kind of big unlock that kind of brought us here that we don't talk as much about?

I think the bigger lock is compute.

I know the story of A I is off in the story of compute, but no matter how much people talk about, I think people underestimated, right? And the amount of growth that we've seen in computational power over the last decade is a sounding the first paper that's really credited with the breakthrough moment in computer vision for deep learning was alex net, which was a twenty to all paper where a deep neil network did really well in the image, that chAllenge and just blue way, all the other algorithms that they had been working on, the types of grams that they had been working on, more in grand school, that alex net was a sixty million prime ter. Deeper neil network.

And IT was trained for six days on two GTX5 eighties， which was the top consumer card at the time, which came out in two thousand ten. So I was looking at some numbers last night just to put this in perspective. And the newest, latest and greatest from the video is the gb two hundred.

Do either you want to guess how much raw compute factor we have between the GTX8 bai and the G P two hundred two？ No one go for IT is in the thousands. So I ran the numbers last night that two week training run that of six days on two g text by dates. If you scale IT comes out to just under five minutes on a single .

G B two hundred. Justice is making a really good point. The twelve, alex, that paper on image chAllenge is literally a very classic model, and that is the convolution on your network model.

And that was published in nineteen eighties. The first paper I remember as a graduate student learning that, and IT, moreover, also has six, seven layers. Practically the only difference between alex, that and the convention. The difference is the two G, P, U, and the delusion of data. yeah.

So I think most people now familiar with, quote, the bitter lesson. And the bitter lesson says, as if you make an algorithm of cute, just make sure you can take a advantage available compute, because available, you will show up. On the other hand, there's another narrative which seems to mean to be just incredible, which is is actually new resources.

That is, that is a great example. Self attention is great from transformers. But this is the way you can explain human labeling of data because it's the humans that put the structure in the sentences.

And if you look at clip and say, well, I producing the internet to, like, actually have humans use the all tag to label images, right? And so like, that's a story of data. That's not a story of computer. And so is the answer just both is like .

that you are another really good point. So I think there's actually two e pox that to me feel quite distinct algorithmic here. So like the image era is actually the error of supervised learning.

So in the era of supervised learning, you have a lot of data, but you don't know how to use data on its own like the expected of image net and other data sets of that time period was that we're gonna get a lot of images, but we need people to label everyone and all of the training data that we're gonna train on. A human. Labelle has looked at everyone and said something about that image. And the big alarm unlocks. We know how to train on things that don't require human label data .

is the ninety personal room that does have an AI background. IT seems to me if you're training on human data, the humans of label .

that is just not explicit. I knew you'd going to say that thing. I knew that yes, philosophical, that's a really important question. But that actually is more true language then pixel fair of.

yes, yeah, yeah, yeah, yeah. But I do think is an important distinction because clip really is human labels yeah. I think attention, as humans have, I figured out relationships of things and then you learn them. So that is human label, just more implicit than explicit.

Yeah, it's still human label. The distinction is that for this supervise learning era, are learning tasks were much more constrained. So you would have to come up with this ontology of concepts that we wanted discover, right? If you're doing image now, faith and your students at the time spent a lot of time thinking about which thousand categories ies should be in the image chAllenge. Other data set of that time, like the coco data for object protection, they thought really hard about which eighty categories we put in there.

This walked in G. A. I. So when I was to my h before that you came. So I took a machine learning from entering, and I took asian, something very complicated from definite color, and very complicated for me a lot, that which is predictive modeling.

And then I remember the whole kind of visions of that you unlock, but then the stuff shown up out in the last four years, which is, to me, very different, not identity objects, not predicting something, you're generating something. And so maybe kind of walk through like the key unlocks that got us there and then why it's different? And if we should think about IT differently in or is that part of a continuum?

Is that not IT is so interesting. Even during my graduate time, generate model was there. We wanted to do generation.

Nobody remembers. Even with letters and numbers, we were trying to do some. Jeff hinton has had to generate papers. We were thinking about how to generate.

And in fact, if you think from a probability distribution point of view, you can, mathematically, genre is just nothing we generally would ever impress anybody, right? So this concept of generation, mathematically, theoretically, is there. But does he worked justice? P. H.

D? His entire P, H, D is a story, almost the many story of the trajectory of the field. He started his first project in data. I forced them to. He didn't like IT.

So in that respect to, I learned a lot of really .

useful things. I'm lad. You say that now.

So actually my first paper, both at my P. H. D, and like ever, my first academic publication ever, was the image of triple .

with scene graphs, although we wanting to taking pixel generating words. And just then, and andry really worked on that, but that was still a very, very lossy way of generating and getting information out of the pixel world. And then in the middle, just went off and did a very famous piece of work.

And IT was the first time that someone made a real time, right? Yeah, yeah. So the story there is.

there is a paper that came out in twenty and fifteen, a neural algorithm, artistic style, LED by leonidas. And the paper came out, and they showed these real world photographs that they had converted into a mango style. And we are kind of used to seeing things like this in twenty and twenty four, but this wasn't twenty fifteen.

So this paper just popped up on archive ve one day, and I blew my mind. I just got this genii brainworm in my brain in twenty fifteen. And that did something to me.

And I thought, oh my god, I need to understand this alga M, I need to play with that. I need to make my own images interval go. So then I like read the paper.

And then over a long weekend, I reimplement ted the thing and got IT to work. IT was actually very simple algorithm. So like my implementation was like three hundred lines of lua, because at the time I was prit was lua.

This was prepaid torch. So we were using lua torch. But IT was like very simple, although dom, but I was slow, right? So was an optimization based thing. Every image you want to generate, you need to run this optimization loop on this gradient to stand loop forever image to generate damages are beautiful. But I just wanted to be faster .

and just then just did IT. And I was actually, I think, your first taste of and I could that make work having the industry impact. A bunch of .

people had seen this artistic s style transfer ff at the time. At me, a couple lawyers at the same time came up with different ways to speed this up. But mine was the one that got a lot of attraction .

before the world. Understand gena. I just in the last piece of working P, H, D, was actually inputting language. And getting up whole picture out is one of the first deny I, uh, work is using gun, which was so hard to use. The problem is that we are not ready to use a natural piece of language.

So just then you heard he worked on syn graph, so we have to input a syn graph language structure. So the sheep, the grass, the sky in the graph way, and literally was one of our photos, right? And then he, and another very good mass student, grim, they got that again to work.

So so you can see from data do matching to style transfer to generative images, we're starting to see you ask if this is a rub change for people like us. It's already happening in the continuing. But for the world, the results are more abrupt.

So I read your book and for those are listening. Phenomenal book. I like I really recommend read IT. And that seems for a long time. I like a lot of and you feel like a lot of your research has been and your direction has been towards kind of speciaal stuff and pixel stuff and intelligence. And now you're doing world labs and it's around spain, intelligence.

And so maybe talk through, is this been part of long journey for you? Like why did you decide to do IT now? Is that a technical? What is IT a personal market move us from that may leu of A I researched .

to world babs. For me, IT is both personal and intellectual, right. My entire internecine al journey is really this passion to seek north stars, but also believing that those new stars are critically important for the advertisement of our field.

So at the beginning I remembered after graduate school, I thought my north star was telling stories of image is because, for me, that's such an important piece of visual intelligence as part of what you call A I or A G I. But when just then, and I did that, I was, I got, my god, that that was my life stream. What do I do that next? So you came along faster.

I thought I would take one hundred years to do that. But visual intelligence is my passion, because I do believe, for every intelligent being, like people or robots or some other form, knowing how to see the world, reason about IT interact in IT, whether you're navigating or manipulating or making things, you can even build civilization upon IT a visual spac. Intelligence is so fundamental, is as fundamental as language, possibly more ancient and more fundamental in certain ways.

So it's very natural for me that our north star is to unlock special intelligence. The moment to me is, right, we've got these ingredients. We've got compute. We've got much deeper understanding of data, way deeper the image that days compared those days were so much more sophisticated. And we've got some advanced ment of algorithms, including co founders in our lab, like by milton home, Christopher. Later they were at the cutting edge of nerve that we are in the right moment to really make a bet and to focus and just unlock that.

So I just want to clarify for focus, are listening to this starting this company, world lab special intelligence is kind of her generally described in the problem you're solving. Can you maybe try to crisp ally describe what that means?

yes. So a spatial intelligence is about machines s ability to perceive, reason and act in 3 space and time to understand how objects and events are positioned in 3d space and time， how interactions in the world can affect those forty positions over space time, and both sort of perceive reason about generate, interact with, really take the machine out of the main frame or out of the data center, and putting IT out into the world and understanding the three forty world with all of its richness ness.

So to be very clear, we talking about the physical world or we just talking about the right notion of world.

I think I can be both. I think I can be both. And that encompasses our vision long term. Even if you're generating words, even if you're generating content position in three d has a lot of benefits or if you're recognizing the real world, being able to put three understanding into the real world as well as part of IT.

just be listening like the two other cofounder on the home, Christa, or absolute legends in the field at the same level. These four decided to come out to this company now. And so i'm trying to dig to like why now is the right time? Yeah.

I mean, this is again part of a longer evolution for me. But post P. H, D, when I was really wanting to develop into my own independent researcher, both for my later career, I was just thinking, what are the big problems in A I and computer vision? And the conclusion that I came to about that time was that the previous decade had mostly been about understanding data, the really exists, but the next decade was going to be about understanding new data.

And if we think about that, the data that already exists was all of the images and videos that may be existed on the web already. And the next decade was going to be about understanding new data, right? People have smart phones.

Smart phones are collecting cameras. Those cameras have new sensors. Those cameras are positioned in the third world. It's not just you're gonna get a bag of pixel from the internet and know nothing about IT and try to say if it's a cat or a dog. We want to treat these images as universal sensors to the physical world.

And how can we use that to understand the three and forty structure of the world, either in physical spaces or generate spaces? So I made a pretty big pivot post P. H. D, into a three d computer vision, predicting three shapes of objects with some of my colleagues at fair at the time.

Then later, I got really enamored by the idea of learning 3d structure through two d right？ Because we talk about data lot, 3d data is hard to get on its own， but because there's a very strong mathematical connection. Here are two images, are projections of a three world, and there's a lot of mathematical structure here we can take advantage of.

So even if you have a lot of two d data, there's a lot of people and amazing work to figure out. How can you back out the 3d structure of the world from large quantities of two d observations？ And then in twenty twenty, you ask about big breakthrough moments.

There was a really big breakthrough moments from our cofounder and melbourne of the time with his paper nerve and the oratio fields. And that was a very simple, very clear way of backing out three structure from two t observations that just let the fire under this whole space of free d computer vision. I think there's another aspect here that maybe people outside the field don't quite understand.

That was also a time when large language models were starting to take off. So a lot of the stuff with language modeling actually had gotten developed in academia. Even during my P.

H, D, I did some to work with language. And you remember R N N G R U, like this is pretre former. But then at some point, like around the GPT two time, like you couldn't really do those kind of models anymore in academia because they took a way more resources. But there was one really interesting thing of the nerve approach that ban came up with, like you could train these in a couple hours on a single GPU.

So I think at that time, there was a DNA ic here that happened, which is that I think a lot of academic researchers ended up focusing a lot of these problems because there was core alga mixed stuff to figure out, and because you could actually do a lot without a ton of computer, and you could get there to the results on a single GPU. Because of those dynamics, there was a lot of research, a lot of researchers in academia were moving to think about what the core algorithm ways that we can advance this area as well. Then I ended up chatting with faa more, and I realized that we were actually, she's very convincing, is convincing.

Well, there's that. But you will talk about I trying to figure out your own independent research trajectory from your advisor that turns out convery similar OK. For my end, I wanted .

talk to the smart as person I call Justin. There's no question about IT. I do want to talk about a very interesting technical story of pixel that most people working language don't realize is that free geni era in the field of computer vision.

Those of us who work on pixel, we actually have a long history in an area of research called reconstruction three d reconstruction IT dates back from the seventies. You can take photos because humans up to us is right, so in general starts with stereo photos. And then you try to translate the geometry and make a 3d shape out of IT。 IT is a really, really hard problem to this day.

It's not fundamentally solved because there is correspondence and all that. So this whole field, which is an older way of thinking about three d, has been going around and has been making really good progress. But when there have happened in the context of generative methods in the contest of diffusive models, suddenly reconstruction and generation start to really merge.

Now within really a short period of time in the field of computer vision, it's hard to talk about reconstruction versus generation and email. We suddenly have a moment where if we see something or if we imagine something, both can converge towards generating IT. And that's just, to me, a really important moment for a computer vision. But most people missing or not talking about that as much as LLM s when picture .

spaces are reconstruction, you reconstruct like a sem that's real. And then if you don't see the same, the new generation techniques, right? So these things are a very similar throughout this entire conversation. You're talking about languages and you're talking about pixel. So maybe it's a good time to talk about how like spatial intelligence and what you're working on contrast with language approaches, which of course are very popular now is a complimentary is a dog.

I think they are complementary.

I don't know. Mean, to be too leading here may be just contrast. Like everybody says, I know open the eye and I know GPT and I know multiple del models. And a lot of what you're talking about is like they've got pixel and y've languages is and doesn't this kind of do what we want to do with special reasoning?

yeah. So I think to do that, you need to open up the black box a little bit of how these systems work under the hood. So with language models and the multimodal language models that we're seeing nowadays, underlying representation under the hood is a one dimensional representation.

We talk about context length, we talk about transformers, we talk about sequences, attention. Fundamentally, their representation of the world is one. So these things fundamentally Operate on a one dimensional sequence of tokens.

So this is a very natural representation when you are talking about language, because written text is, at one dimensions, sequence of discrete letters. So that kind of underlying representation is the thing that LED to elements. And now the multimodal s that we're seeing now, you kind of end up shoehorning the other modalities into this underlying representation of one sequence of tokens.

Now when we moved to spain intelligence, it's kind of going the other way where we're saying that the three dimensional nature of the world should be front and center in the representation. So in an algorithmic perspective, that opens up the door for us to process data in different ways, to get different kinds of outputs out a bit and to tackle solid different problems. So even at yet, of course, level, you can look at outside and you say, oh, multimodal, ms.

Can look at images too. Well, they can. But I think they don't have that fundamental treaty representation at the heart of their approaches.

I totally agree with just then. I think talking about the one d versus fundamentally three d representation is one of the most core differences. Now, the thing is a slightly philosopher, hc, but it's really important, for me at least, is languages fundamentally a purely generated signal.

There's no language out there. You don't go out in the nature and there's words written in the sky for you. Whatever data you feeding you, pretty much can just somehow reggina with enough generalize ability, the same data out. And that's language to language. But three d world is not.

There is a 3d world out there that follows laws of physics that has its own structures due to materials and many other things。 And to fundamentally back that information out and be able to represent IT and be able to generate IT is just foments ally. Quite a different problem. We will be bordering similar ideas, or useful ideas from language and LLM s but this is fundamental philosophically, to be a different problem.

So language one day, and probably a bad representation of the physical world. This has been generated by humans and is probably there is a another modality of genera I models, which are pixel. And these are two, the image and to the video. And like one could say, if you look at the video, you can see three stuff because, like, you can pen a camera or whatever IT is. And so like, how would like patient intelligence be different than, say.

two video, when I think about this is useful to sist tangle, two things. One is the underlying representation, and then two is kind of the user facing affordances that you have. And here's where you can get sometimes confused because fundamentally, we c to d right? Our retina are two d structures in our bodies, and we've got two of them.

So fundamentally, our visual system perceives two d images. But the problem is that depending on what representation you use, there could be different afford dances that are more natural or less natural. So even if you, at the end of the day, you might be seeing a two d image or a to d video, your brain is perceiving that as a projection of a 3d world。 So there's things you might want to do, move objects around, move the camera around.

In principle, you might be able to do this with a purely to representation and model, but it's just not a fit to the problems that you're asking the model to do, right? Modeling the two d projections of a dynamic 3d world is a function that probably can be model by putting a 3d representation into the heart of a model。 This is going to be a Better fit between the kind of representation that the model is working on and the kind of tasks that you want that model to do. So our bet is that by threats a little bit more, 3d representation under the hood Better enable， Better afford dances for users.

And this also goes back to the north. Start for me, why is IT spatial intelligence? Why is there not flat pixel intelligence? Is because I think the art of intelligence has to go to what Justin cause affordances at the art of intelligence. If you look at evolution, right? The art of intelligence eventually enables animals and humans, especially human, as an intelligent animal, to move around the world, interact with the create civilization, create life, create a piece of sandwich, whatever you do in the third world, and translated that into a piece of technology that native three genes is fundamentally important for the flood of possible applications, even if some of them, the serving of them, looks too d but it's inie three d to me.

I think, very subtle and incredibly critical point. And so I think it's worth taking into. And a good way to do this is talking about use cases. And so just the level at this we're talking about generating a technology is color a model that can do speciaal intelligence. So maybe in the add track, what might that look like kind of a little bit more concretely?

There's a couple different kinds of things we imagine these specially intelligent t models able to do over time. And one that i'm really excited about is world generation world used to something like a text image generator or starting to see texting video generators where you put an image, put in the video and out pops an amazing image, an amazing to a second clip. But I think you could imagine leveling this up and getting three worlds out.

So one thing that we could imagine spatial intelligence helping us with in the future are uploading these experiences into three, where you're getting out of full virtual simulated, but viBrant and interactive 3d world， right? Maybe for gaming, maybe for virtual photography, you name IT. Even if you get this to work, they're be a million applications.

Education, I mean, in some sense, this enables a new form media, right? Because we already have the ability to create virtual interactive worlds. But IT costs hundreds of millions of dollars and a ton of development time.

And as a result, what are the places that people drive this technological ability? Is video games, right? But because IT takes so much labor to do so, then the only economically viable use of that technology in its form today is games that can be sold for seventy dollars a piece to millions and millions of people to recruit the investment.

If we have the ability to create these same virtual, interactive, viBrant three worlds, you could see a lot of other applications of this, right? Because if you bring down that costs of producing that kind of content, then people are going to use that for other things, right? What if you could have sort of a personalized, create experience that as good as, as rich, as detailed as one of these trip I video games, and cause hundreds, millions of knowledge to produce? But IT could be cared to this very niche thing that only maybe a couple of people would want that particular thing. That's not a particular product or particular road map, but I think that's a vision of a new kind of media that would be enabled by spatial intelligence in the general rome.

If I think about the world actually, think about things that are not just seen generation. I think about something movement in physics. And so like in the limit, is that included? And if i'm interacting with other semantics, and I mean, by that, like I open a book, are they like pages and are the words in IT? And do they mean like I would we take like a full of debt experiment? Talk about a static scene.

I think I see a progression of this technology over time. This is a really hard stuff to build. So I think the static problem is a little bit easier. But in the limit, I think we want this to be fully dynamic, fully interactive, all the things that you just said.

I mean, that's the definition of space, al intellects. So there is gonna be a progression will start with more static. But everything you've said is in the road map of space, al intelligence. I mean.

this is kind of in the name of the company itself. World labs, like the world, is about building and understanding worlds. And this is actually a little bit inside baseball, I realized after we told the name, people they don't always get IT.

Because in computer vision and reconstruction and generation, we often make a distinction or dilation about the kinds of things you can do and kind of the first level of objects, right? A microphone, a cup, a chair, these are direct things in the world. And a lot of the image net style stuff that fa worked on was about recognizing objects in the world, then leveling up the next level, objects, I think, of the scenes.

Scenes are compositions of objects. Now we have got this recording studio with a table and microphones and people and chairs at some composition of objects. But then we envision worlds as a step beyond scenes, right? Scenes are kind of making individual things, but we want to break the boundaries, go outside the door, step up from the table, walk out from the door, walked on the street and see the cars buzzing past and see the leaves on the trees moving and be able to interact with those things.

And I think that's really exciting because just to mention the word new media, with this technology, the boundary between real world, virtual imagine world or augmented world or dict world is all blurry.

The real world is reading, right? So in the digital world, you have to have a 3d representation to even blend with the real world。 You cannot have a two d, you cannot have a one d to be able to interface with the real third world in the effective way. What does IT unlocks IT. So in the use cases can be .

quite limitless because of this, the talking a any number of these cases when that you are just adding to would be more of an .

augmented reality, right? Yeah, just around the time world lab was being formed, vision pro was released by apple. And they use the word spatial computing.

Were almost they almost still. But we're special intelligence. So, so special computing needs space al intelligence, that's exactly right. So we don't know what hardwork form IT will take. It'll be gaggles glasses, contact glass. But that interface between the true real world and what you can do on top of IT, whether it's to to help you to augment your capability to work on a piece of machine and fix your car, even if you are not trained mechanic, or to just be populum, suddenly this piece of technology is going to be the Operating system, basically for A R V R mix r in the limit.

What is an A R device need to do is this thing that's always on, it's with you. It's looking out into the world. So IT needs to understand the stop that are seeing and maybe help you out with tasks in your daily life.

But i'm also really excited about this blend between viral musical that becomes really critical if you have the ability to understand what's around you in real time. In perfect 3， then IT actually starts to deprecate large parts of the real world as well. Like right now, how many differently sized screens do we all own for different use cases too? You ve got your phone, you've got your ipad, you've got your computer monitor, you've got your T, V, you've got your watch. Like these are all side screens because they need to present information to in different context and in different, different positions. But if you ve got the ability to seamlessly blend virtual content with the physical world IT kind of deprecate tes, the need for all of those IT just ideally seamlessly blends information that you need to know in the moment with the right mechanism of giving you that information.

And not a huge case of being able to blend the digital virtual world with the 3d physical world is for airing agents to be able to do things in the physical world。 And if humans use this mix, are devices to do things. Like I said, I don't know how to fix a car, but if I have to, I put on this gono or glass, and suddenly i'm guided to do that.

But there are other types of agents, namely robots, adding kind of robots, not just humanoids. And their interface, by definition, is the radio world, but their computer, their brain, by definition, is the digital world. So what connects that from the learning to behaving between a robot brain to the real world brain? IT has to be special intelligence.

So you've talked about virtual world. You have talked about kind of more of an augmented reality. And now you ve just talked about the purely physical world, basically, which would these for robotics, for any company that would be like a very large charter, especially if you're gonna get into how do you think about the idea of like deep, deep tag there is any of these specific .

application areas we see ourselves as a detect company, as the platform company that provides models that can serve different use cases .

of these three. Is there anyone that you think is kind of more natural early on that people can kind of expect the company to lean into?

I think it's surprises to say the devices are not totally ready.

Actually, I got my first VR has set in grad school. That's one of these transformative technology experiences you put IT on. You're like all my god, like this is crazy. And I think a lot of people have that experience. First time I use VR, so i've been excited about the space for a long time, and I love the vision pro like I stayed late to order one of the first ones like the first day came out. But I think the reality is, is just not area as a platform for mass market appeal.

So very likely as a company will move into a market that's more ready than but you know, we are a deep tech company.

then I I think there can sometimes be simplicity and generality, right? We have this notion of being a deep tech company. We believe that there is some underlying fundamental problems that need to be solved really well. And if solved really well, can apply to a lot of different domains. We really view this long arch accompany as building and realizing the dreams of facial intelligence rate large.

So this is a lot of technology to build.

This seems to me, yeah, I think it's a really hard problem. I think sometimes from people who are not directly in the A I space, they just see that as A I, as one undifferentiated massive talent. And for those of us who have been here for longer, you realized that there's a lot of different kinds of talent and need to come together to build anything in a in particular, this one.

We have talked a little bit about the data problem. We've talked a little bit about some of the algorithms that I worked on during my PHD, but there's a lot other stuff we need to do this too. You need really high quality, large coal engineering.

You need really deep understanding of the third world. There is actually a lot of connections with computer graphics because theyve been kind of attacking a lot of the same problems from opposite direction. So when we think about team construction, we think about how do we find like absolute top of the world, best experts in the world at each of these different sub domains that are necessary to build this really hard thing.

When I thought about how we form the best funding team for world labs, IT has to start with a group of phenomenal mult disciplinary founders. And of course, just then, natural for me, he just then cover your ears as one of my best students and one of the smartest technologist.

But there are two other people I have known by reputation, and one of them just even worked with that I was dying for, right? One is the milton, how we talked about his seminal work, nerve. But another person is Christophor, who has been reputed in the community of combust graphics. And especially he had the four sight of working on a precursor of the gossip flat representation for 3d modeling five years right before the cautions blood take off。

Then in china, our legends maybe just quickly talk about kind of like you fight about to build out of the rest of the team because again, like is a to build no work on just in a graphics like systems.

So far yeah, this is what so far I personally most proud of, is the formidable team i've had the privilege of working with the smartest Young people in my entire rear, right from the top universities, being a professor at stanford. But the kind of talent that we put together here at our labs is just phenomenon. I've never seen the concentration.

And I think the biggest differentiating element here is that were believers of special intelligence, all of the multiple civinal talents, whether its system engineering, machine learning, info to generate modeling, to data, to graphics, all of us, whether is our personal research journey or technology journey or even personal hobby. And that's how we really found our founding team. And that focus of energy and tAllent is hungry to me. I just love IT.

So I know you've guided by north star. So something about north stars is like you can't actually reach them because they are in the sky, but it's a great way to have guide. And so how will you know when you have accomplish what you've set out to accomplish? Or is this a light song thing that's going to continue kind of infinitely?

First of all, there is real nor stars and virtual nor stars. Sometimes you can .

reach virtual model. Like I said.

the way I thought one of my north star that would take a hundred years was storytelling, the images, and just an untrained, in my opinion, solve IT for me. So we could get to our north star. But I think for me is when so many people and so many businesses are using our models to unlock their needs for special intelligence. And that's the moment I know we have reached a major miles.

Actually, actually, I I, I don't think we're gonna get there. I think that this is such a fundamental thing. The universe is a giant evolving for dimensional structure and spatial intelligence rate. Large is just understanding that in all of the steps and figure out all the applications to that. So I think we have we were particular set of ideas in mind today, but I think this journey is going to take us places that we can even imagine right now.

The magic of good technology is that technology opens up more possibilities and unknown. So we will be pushing and then the possibilities will be pending.

Thank you, Justin. Thank you that this is fantastic.

Thank you marking. Thank you.

All right, that is over today. If you did make IT as far, first of all, thank you. We put a lot of thought into each of these episodes, whether it's guess the calendar touches cycles with our amazing editor, Tommy, until music is just right.

So if you like what we have put together, consider dropping this in line at rate, this pocket al com splash asic, tencent. And let us know what your favorite pit is. I don't make my day, and i'm sure Thomas too will catch you on the flip side.

The Frontier of Spatial Intelligence with Fei-Fei Li 44:40 Share

a16z Podcast

Deep Dive

Shownotes Transcript

The Frontier of Spatial Intelligence with Fei-Fei Li