We're sunsetting PodQuest on 2025-07-28. Thank you for your support!

Teaching AI to Understand the Physical World, with Dr. Fei-Fei Li of World Labs

2025/6/5

No Priors: Artificial Intelligence | Technology | Startups

AI Deep Dive AI Chapters Transcript

People

Fei-Fei Li

Topics

Fei-Fei Li: 空间智能对我而言，是理解、推理、互动和生成3D世界的能力。我们的世界本质上是3D的，无论我们如何投射它。如果存在一个真实的3D表示，许多事情，如设计、创造、导航、模拟或体验AR/VR，都会变得更容易实现。人类和动物都具备空间智能，这与进化过程紧密相连。空间智能是如此基础，以至于没有它，人工智能是不完整的。我认为从神经和认知科学的角度来看，空间智能是进化必须为动物解决的一个难题。动物必须进化出收集光线的能力，然后用这些光线在脑海中重建一个3D世界，以便它们能够导航和行动。人类是操纵能力最强的动物，所有这些都与空间智能有关。即使对于人类来说，空间智能也不是一个完全解决的问题，例如，闭上眼睛构建周围环境的3D模型并不容易。如果我们能更容易地在指尖上完成复杂的3D模型，并实现更流畅的互动和可编辑性，那将为人们创造一个全新的世界。

Deep Dive

Shownotes Transcript

Translations:

中文

Hi, listeners, and welcome back to No Priors. Today's guest is Dr. Fei-Fei Li, a pioneer in computer vision and deep learning. She created ImageNet, the groundbreaking data set that helped spark the deep learning revolution. Fei-Fei is a Stanford professor and the co-director of the Stanford Institute for Human-Centered AI.

She's also led AI at Google Cloud, advised international policymakers, and recently co-founded World Labs, a company dedicated to developing spatially intelligent AI. Fei-Fei, thank you for joining us today. Well, thanks for inviting me. This is going to be fun. So you have made extraordinary contributions to science and policy over the past two decades. I'll start with the biggest question, like why start a company now? Because in my heart, I want to build.

I see this as such a critical and fun and exciting moment to build some extraordinary technology that everybody can use. And I believe so much in spatial intelligence and the kind of 3D world models that can empower so many people as well as so many use cases. And I think that's just, it's...

going to be really exciting. And I can do that with an extraordinarily brilliant group of young technologists. I want to come back to, you know, the people you're working with, because I know some of your co-founders and was, you know, trying to convince them desperately to start a company a while back. And then they were like, oh, no, we have a bigger mission now with Fei Fei. What is spatial intelligence? Can you define it for a broader audience?

Spatial intelligence to me is the ability to understand reason and interact and generate 3D worlds because our world fundamentally, no matter how you say we can project it, fundamentally is 3D. And it's 3D because physically it's 3D. And digitally, if there is a

true 3D representation, then we can make a lot of things happen more easily, whether it's designing or creation or navigation or simulation or the experiencing of AR, VR. All this, to me, is part of spatial intelligence. And again, I think it's what really excites me is humans have spatial intelligence. We are, it's part of our

core intelligent capabilities. Animals have a spatial intelligence. The entire journey of evolution also is deeply intertwined with the evolution of spatial intelligence. So it's so fundamental

Without spatial intelligence, AI would be incomplete. How does that translate into what you're doing with your company? Or is there anything you can share in terms of what that means relative to what you're building? Yeah, so we're cracking one of the hardest problems in AI, which is actually making world models that are fundamentally 3D. Because once you can...

crack that problem, you can unlock a lot of spatial intelligence problems. So we are the first company we know of that is solving the 3D generation foundation model problem. I have many questions, but since you are

you know, describing this first as, you know, the, you know, 3D's criticality to just sort of understanding the world. Does that imply you feel that the world models that, you know, world labs will create or others in academia or in companies will create, will someday be like, you know, realistically accurate, like represent physics and understand the world that we can do many more things with?

Yeah, it should. It should be realistically accurate or plausible. So you can create a fantastical world, but it should be plausible because the geometry and the physics of it need to be plausible and plausible.

And that is fundamental to spatial intelligence. Does that imply you have a particular point of view from like a neuroscience perspective of like, you know, how fundamental visual... I mean, you've always been a leader in...

computer vision, right? But in how important visual intelligence is versus, let's say, like large language models and textual intelligence. I actually do. I think from a neural and cognitive science point of view, that spatial intelligence is a really hard problem that evolution has to solve for animals.

And what's really interesting is I think animals have solved it to an extent, but not fully solved it. It's one of the hardest problems because what is the problem animal has to solve? Animals have to evolve the capability of collecting lights in something which we call eyes mostly.

And then with that collection of eyes, it has to reconstruct a 3D world in their mind somehow so that they can navigate and they can do things. And of course, they can interact. For humans, we're the most capable animal in terms of manipulation. We can do a lot of things. And all this is spatial intelligence. To me, that's just rooted in our intelligence.

What is interesting is it's not a fully solved problem, even in animals. We, for example, for humans, right? If I ask you to close your eyes right now and draw out or build a 3D model of the environment around you, it's not that easy. We don't have that much capability to generate images.

extremely complicated 3D model till we get trained. You know, there are some of us, whether they're architects or designers or just people with a lot of training and a lot of talent. And that's a hard thing to do. And imagine you do it at your fingertip much more easily and allow much more

more fluid interactivity and editability, that would just be a whole different world for people, no pun intended. Are there other big areas like spatial intelligence that you feel haven't been as developed as they could be from a model perspective or other sort of missing gaps that you think in general as we think

as we build this sort of AI future, we should focus on over time or people should build out. I was just wondering, in addition to sort of 3D and world generation and other big problems like that, because it feels like there are a few big things that we've solved for over time and other things we're working on. We're short of solving language. I would say language is solved to a huge extent.

And 3D to me is as critical and difficult as language. So what else does that solve? I mean, the entire space of emotional intelligence is something that I don't even know how to begin to solve. I know a lot of people who haven't solved it. That's when AGI is achieved. Yeah, so that's another one.

And I can tell you the training data for that is not going to come from Silicon Valley people. Don't underestimate Silicon Valley. I'll put myself in this bucket, but I think we probably need a broader set of people. Yeah, no, that I agree. But these are the three...

three big buckets. To be honest, I don't know. What do you think, Ilan and Sarah? I think it depends a lot on what you encapsulate in each model. So I agree with your framework in terms of those three. And then certain things like the spatial intelligence, I'm assuming, also delves into different types of physics simulation and simulations of the world. Those are big areas that I think a lot of people aren't working on that I think are really interesting or important. So

And there's sort of the macro and the micro scale of that. The micro scale eventually becomes material sciences and other very different types of things from what you're talking about, where it's more molecular modeling or, yeah. Right. And also someone goes out the current definition of AI, which I do think they'll be empowered by. Of course, there's robotics, but robotics is very much a system integration problem as much as a, you know, even if you look at animals, it's not just robots.

the compute in the brain per se, right? Yeah, a lot of these things seem to be much more distributed in terms of spatial intelligence relative to specific systems that animals have. And in some cases, it's to your point, not as centralized as one would think. So it's very interesting to start thinking in terms of those models of more distributed intelligence across an organism.

versus the CNS. But yeah, I think it's very interesting stuff. You've also done work in this field, Fei-Fei, of robotics and physical intelligence. I think of the data hierarchy for robotics foundation models and actuation as people want to, of course, use video.

right? Because that is what is available to us. There's a big question on like simulation and how much you can get from that today. Perhaps people do not see the future of like the quality and the physics that are going to be available to us.

And then there's close to embodied, like different forms of tele-op and then like embodied data collection. Is that the hierarchy you have in your mind or do you think people underestimate simulation and world models for the future? Yeah, great question. First of all, like you said, I do work in robotics, especially in my lab at Stanford. I have no doubt that humanity will move into an age where we cohabit with robots. And also the world...

The world robot is not humanoid per se. Robots take in all kinds of forms and shapes. Actually, a few years ago, my lab wrote a really fun paper about morphological intelligence is where the morphology of a robot

an agent actually can change by optimizing the tasks they're trying to achieve. So we should be a little more imaginative than just humanoids. Having said that, how to train robot, you mentioned this whole data, some people call it data pyramids or data cakes or whatever. I agree. I think it's going to be a hybrid of many different forms of data. I also think simulation is important

underrated. Actually, it's not underrated by a lot of experts and people in the field. If you look at a lot of robotics companies, they are working on simulation and synthetic data. I also think we have to be also aware that unlike language models or even unlike spatial intelligence foundation models, robotics is a

highly multimodal system that I think what is truly underappreciated, in my opinion, is haptics.

is there's so much, especially if we want to do manipulation, not just navigation. I think haptics data and the ability to really integrate haptics into vision and perception and spatial data is absolutely critical. One thing that you said that I thought was really interesting is how many different, what are the different morphological forms that a robot may adopt? And there's sort of two counter arguments people make in terms of the potential future.

One argument is that from a supply chain perspective and managing builds and scale of manufacturing, you're going to have many fewer form factors. And the other argument is the economic value of specialization is very high.

And therefore, there'll be, you know, thousands and thousands of different form factors as we move to sort of a robot-driven future. Do you have a point of view on sort of where we're likely to land between those two viewpoints? I think we're going to gradient descending to optimization of productivity and efficiency. My hypothesis is that the requirements of different tasks are so vast that having very few form or sticking with one form is impossible.

energy inefficient. And a lot of tasks can be done and should be done by much more energy efficient form factors. Just an extreme and trivial example

If we put robots underwater, they should not be in the shape of humans. They better be in the shape of fish, right? Just think about energy efficiency. And the same with flying. I don't think human form is... Our airplanes are becoming more and more robots. And so I...

I do think there's going to be diversity. Robotics is one potential application for the future. You're a scientist first, but also, you know, did the Twitter board, involved in startups. What are the near-term commercial applications that you can imagine for generating 3D worlds? I believe creativity is a vastly exciting area where humans can be superpowered

by AI and by spatial intelligence. And here I draw an analogy with software engineering. If you look at today's success of LLMs in software engineering, including applications like Cursor and Windsurf and all that, what you see is not

a lot of collaboration between AI and humans. And then the collaboration comes in different levels of skill sets and all that. And I think creativity will be similar, is that whether we're talking about designers, 3D artists, VFX,

artists or even marketing talents and game developers, there's so much need in designing and creating 3D space. And this is fundamentally such a hard problem, even for the trained, skilled people, that having a collaborator will be extremely fun if we do it right.

And so I see creativity as an area that is really exciting. I also do think that a lot of what we're waiting for, for metaverse or AR, VR is content creation. I understand hardware itself needs to continue to evolve, but I also think software, we're looking for content creation and that lends itself so naturally to content.

3D modeling and 3D or generative spatial models. And that's another interesting area to look into. Do you have a strong point of view on whether or not world models are like an interesting answer to scalable RL for like more generalizable agents? I actually do think this is, like I said, AI is not complete without spatial intelligence because

humans interact in 3D worlds and in the digital world we need all kinds of interaction. You know, take design as an example. It's a deeply, you know, it has...

When we are thinking about design, there's so much we are optimizing for in our mind's eye, whether it's beauty or efficiency or optimization or whatever it is. And that lends itself pretty naturally to RL settings. What are the biggest challenges in, I guess, trying to go down this path of designing and training world models? I imagine one is

Like you worked on images, you worked on video, but we have images and we have video and we don't have lots of, you know, 3D worlds like in a format I assume you're building. Yeah, data is absolutely a challenge. You're totally right about that. You know, to create world models, 3D foundation models, we require more and more sophisticated data engineering, data acquisition, data processing and data synthesis. So, yeah.

I am envious of my NLP LLM colleagues that the data is so abundant on the internet and we don't necessarily have that luxury. So that's definitely one challenge. Another one challenge is that 3D, this is kind of ironic, right? Every one of us use 3D every day.

Like in so many settings, basically you open your eye and the whole life that you experience is 3D. Okay. Even when we type on the computer or stare at a screen all the time. Yet it's still not as easy a form factor to deliver in the hands of people compared to language. The language is just so easy. And it's also a very active form.

form of, it's not a passive consumption of viewing. Nobody wakes up and say, I'm just going to sit here and watch 3D, you know? So that creates challenges for productization and how to do it in the right way. Were you ever like a Second Life player?

I'm not a gamer, but my kids love Minecraft. I was going to ask you if there was like a world that you want to experience or imagine. That's a great question, Sarah. You know, I would love to see worlds. I love seeing worlds I don't see. For example, like zooming in and in and into like microscopic worlds or, you know, going to...

the inside of an engine, you know, knowing how the actual engine is. I know, of course, I know theoretically how it works, but seeing it with my own eyes, experiencing it, or even you might laugh at this, I want to be inside a dishwasher and just experience what that is. All this can be done in a virtual way if we

managed to create, you know, world models of anything. Okay. I think a lot both, a lot and I both want to talk a little bit about your past career and maybe some insights for anyone doing research or trying to, you know, have an impact within AI. Right before this, I asked Andre Karpathy what I should ask you. And he said, you know,

So Feifei is really magic about ambition and thinking about data. You should ask her about her PhD and the creation of that 101 data set with Pietro because it's instructive. So I have to ask you about that. You know, first of all, I have to say it's always really the greatest thing when you're a student is...

more well-known and achieving so much more than you can. It makes me so proud. So very proud of Andre. I'm surprised that he remembers my PhD work. So yes, it's true. It's, well, gosh, it goes back to 2003-ish and the world was

Just barely scratching the surface of internet and data was not much of a thing. But doing computer vision, we were, my PhD work was really trying to get object recognition to work. That's the problem of calling out cats and dogs and microwaves and chairs and all that when you're presented with a picture. And...

And we were beginning to hypothesize that data matters, but we had no idea. There's no scaling law. We had no idea, you know, how far data can go. All we wanted is if we have a machine learning algorithm, whether it's a neural network or base net at that time was very popular or support vector machine, we need some data to train. And there was no data to train. And as a PhD student, you want to, you know, graduate from

And Pietro was like, well, Feifei, curate a data set. And I was thinking, yeah, I do need to curate a data set because every data set out there is so tiny. I'm just not convinced. And Pietro and I were just talking, you know, 15 different things or 30 different things. And then God forbid, the PhD advisor set the three digit number 100. And I was like,

you know, that's a lot of work, but deep in my heart, I know his right from a mathematical point of view is pushing the model to generalize. We need enough data at least. So, you know, I did write about this process in my book, The Worlds I See, that

I stumbled upon a dictionary somehow, and it really was for my own English study, that the dictionary, I think it's the Webster Dictionary, if I'm not wrong. It just kind of randomly has

depiction of a visual depiction of some words. I don't even know what rule they follow, to be honest. Some are flowers, some are bicycles, some are dogs. I was like, okay, this is actually, you can call it a cheat or a tool. I grabbed 101 of those words. And that really made my PhD advisor kind of chuckle because he's like, ah, yeah, you just want to

do one more than I asked for to, you know, dare me. So that's what I did. And I gotta say that I still remember I downloaded or, you know, tried, you know, from Google and Google was so new at that point. And the Google image search were so terrible at that point, you know, compared to today. And I had to do so much cleaning. At some point I got so desperate. I just asked my mom to do clean the, the, the,

image cleaning because I wrote a little interface on the computer. She doesn't know computer, but at least she knows click, click. So she helped me to do some of that. I mean, you've had one of the most storied careers in AI. And to your point, many of your students have similarly gone on to do really great things across the field, across industry, across the world.

Um, what are two or three moments that you think of, uh, when you think back on your career to date? And obviously there's still a lot of career to come, but I'm just sort of curious. I mean, obviously there's a lot of things that you did in terms of sort of, uh, image and visual recognition related systems and all sorts, but I'm just sort of curious, like when you think.

Think of the last 20 years, what stands out the most? Just given everything that you've done. Oh, thank you for asking that question. Of course, ImageNet is one of those... ImageNet consists of multiple moments from the early struggles and being told I will not get tenure to...

to actually realizing Amazon Mechanical Turk comes to rescue to the moment of AlexNet winning. And also to a couple of years ago, I was at an event in Toronto with Jeff Hinton, and he said publicly how that was so defining. And he was almost a little bit surprised

apologetic, that image that was not as recognized as neural networks. So that journey is very validating. And for scientists, the validation is not about recognition or awards. It's that you made a difference, like that conjecture.

that no one believed in, that hypothesis that no one believed in, we were able to make it happen. So that's one thread. Just to make sure for any people from the business world that are not familiar with it, ImageNet is a large-scale data set with millions of labeled images across thousands of categories, not just 101, right? 15 million labeled images.

15 million labeled images. Thank you, Feifei. That, you know, led to amazing breakthroughs in deep learning, in particular AlexNet and lots of progress in the field of computer vision overall. Yeah, it drove a lot of machine vision forward. And I actually remember in 2016 or 2017, I used to show a slide, which was the history of AI or, you know, back then it was CNNs and RNNs and just GANs were, you know, kind of going. And I had ImageNet and AlexNet as like one of the seminal moments of, you know, this very...

small number of events that really defined AI progress. And obviously now we have transformers as part of that and maybe diffusion models or something, but it was such a big breakthrough. Yeah, thank you. Another moment I'm very proud of was actually Andre and also Justin Johnson and their dissertations. It's where...

In my opinion, the first time that language and images converged by captioning and writing stories of the visual world, it was significant for me for two reasons. One is that I literally thought, I kid you not, at the end of my PhD, I thought if I can live to 100-year-old age,

that was the problem we might be able to solve, which is storytelling of pictures. So I entered my career, like my first year assistant professor, thinking, okay, I'm going to do ImageNet to solve object recognition, and then I'm going to

spend the rest of my entire career solving this problem of storytelling. And then by the time Andre and then a little later, Justin Johnson entered my lab, that was around 20, um,

13, 20, 14, the beginning of deep learning. And then suddenly the combination of sequential model at that point is LSTM. It's not transformer models, but LSTM and CNN just had this lasted open the image captioning

And Andrea and my work were the first together with Google's that was out of the door. And that was really, to me, I almost had, it was made me so proud. I almost had a crisis, which is like, what am I going to do for the rest of my 70 years or 65 years? So that was really exciting how fast the field has, you know, evolved.

Can I ask you one more question about this? Just because you have, um,

you know, made this amazing progress, like very efficiently, right? Like you and I have offline talked before about how you feel it's really important for there to be, you know, moonshots and creativity in AI research beyond like very large funded corporate labs, let's say. And, you know, you pointed to several moments that they come from like

creativity and research in academia. What advice do you have for people about whether or not there's still opportunity for that? Or, you know, it's all just $10 billion training runs from here. My singular advice, and I still say that in my comedy, in my lab, is be fearless.

I think scientists and technologists and entrepreneurs have to be fearless. You know, eventually you have to figure out, do you need $10 billion runs? Or then you come to Sarah and ask for funding. Probably a lot, but both. Yeah. Or you have to figure out, you know, I don't know, data. Sometimes fearless is this...

very interesting position where you're somewhat delusional and crazy, but somewhat just rationally bold. And it kind of is in between because if you're too rational, it's not

You're not identifying problems that are big enough. But if you're completely crazy, then I don't know. There's many things that can go wrong. So be fearless, be courageous. To me, that is, you know, even as old as I am, that's how I feel. I started my startup World Labs is amazing.

I want to be fearless and solve this problem of spatial intelligence. As part of problem solving, you've worked with some of the best AI researchers in the world over time and best engineers.

How do you think about that in the context of your company? Like what sorts of people are you trying to hire? Are there open roles currently? And DataLay is an amazing team. I'm just curious, like what sorts of folks you want to add and how you're thinking about that over time? Yes, we have open roles and we would love to hire the best engineers as well as product thinkers at this point for our company. So if you're an engineer or AI researcher or product talent out there,

passionate about joining the most talented team and solving this problem, please join us. So who do we hire? First of all, we really do hire in diversity of thinking. And this is where, you know, you call us an AI company, but if you look under the hood, we've got

computer graphics experts, we've got computer vision experts, we've got data experts, we've got, you know, generative AI experts, we've got machine learning infra experts, we've got optimization, we've got... So it's actually really important

to hire a diverse group of really talented people because a problem as hard as spatial intelligence is not a homogenous problem. Like it takes talents of all kinds of background to solve it. And then I also just, you know, like I look for fearlessness. Like

You know, we all have. How do you do that? Like, how do you identify if somebody has fearlessness in their background or in their thinking processes? It's in their background. You talk to them. You can sense someone is fearless. You know, you can sense what drives them. You know, you can sense the questions they ask. If they start to asking you a lot of things about fear,

I don't know how to get this done. And I mean, of course you have to ask those questions because you want to get it done. But if, if you sense that it comes from the, the, the, the, the point of view of, um,

being scared of solving that, then that's not fearlessness. But those fearless people, they are creative, they're ambitious, they're not afraid of the uncertainty or the unknown. And I really love that. Well, I think a lot and I, you know, we try to make

make a business of doing business with fearless people and hopefully those that are technically creative. One last broader question for you, because I think an important part of your work has been also thinking about how to bring more people into AI, you know, co-directing the Stanford Center for Human-Centered and Artificial Intelligence. What is your most, like if you picture...

you know, not to use a pun on the book, but if you picture the world like several years out from your last set of predictions, what's your most optimistic view of what human-centered AI looks like? Yeah, thanks for asking. In fact, that is another point of my career I feel very proud of is the founding of human-centered AI institute, HAI, and also the continued movement towards that way of thinking.

I think I want to build a world that AI collaborates and superpowers people. I still believe our world, our human world needs to be human centered, you know, where love, relationship, um,

just prosperity across all communities. These are really important, justice and all. These are really important values. And I don't think any piece of machinery, whether it's AI or airplane or biotech should take those away. But with that, those critical values in mind,

having AI to superpower us is really, really important because there's so many unsolved problems. One application area I had worked on is healthcare, for example, at Stanford, right? If you look at healthcare from drug discovery to cure diseases,

to diagnosis that can reach all people in the world, to treatment that can be accessible to all people in the world, to the whole healthcare delivery, how to make aging better, how to take care of chronic diseases, how to deal with mental health. All of this, we do not have an issue of excessive humans or anything. We're lacking help.

You know, we are lacking scientific discovery. We're lacking diagnosis. We're lacking precision medicine. We're lacking safer and more effective ways of healthcare delivery and aging help and all that.

And that's what I believe. I think AI is a tool to help people. Yeah, I think a lot and I are collectively invested in a series of companies that I hope will be useful here from a bridge to open evidence to latent. But as you said, there's a huge spectrum of problems. And honestly, I've been less optimistic about the adoption of generally technology and health care for the last decade.

15 years, but it does feel like this time it's different. And actually, it's just massively net good here. Yeah, I actually started a digital health company before this. And my hope is finally a lot of the things that people have been talking about for decades will come to fruition. And it seems like AI is a great delivery mechanism for that. So totally. Well, thank you so much, Fei-Fei. It's fantastic. This has been inspiring and great to hear a little bit more about World Labs as well. Thank you. Thank you a lot. Thank you, Sarah.

Find us on Twitter at NoPriorsPod. Subscribe to our YouTube channel if you want to see our faces. Follow the show on Apple Podcasts, Spotify, or wherever you listen. That way you get a new episode every week. And sign up for emails or find transcripts for every episode at no-priors.com.

Teaching AI to Understand the Physical World, with Dr. Fei-Fei Li of World Labs 35:53 Share

No Priors: Artificial Intelligence | Technology | Startups

Deep Dive

Shownotes Transcript

Teaching AI to Understand the Physical World, with Dr. Fei-Fei Li of World Labs