My entire career is going after problems that are just so hard, bordering delusional. To me, AGI will not be complete without spatial intelligence, and I want to solve that problem. I just love being an entrepreneur. Forget about what you have done in the past. Forget about what others think of you. Just hunker down and build. That is my comfort zone.
So I'm super excited here to have Dr. Fei-Fei Li. She has such a long career in AI. I'm sure a lot of you know her, right? Raise your hand. I know you too. She's been named the godmother of AI. One of the first projects that you created was ImageNet in 2009, 16 years ago. Oh my god.
Don't remind me of that. Now it has over 80,000 citations and it really kicked off one of the legs or stools for AI, which is the data problem. Tell us about how that project came about. It was pretty pioneering work back then.
Yeah, well, first of all, Diana and Gary and everybody, thanks for inviting me here. I'm so excited to be here because I feel like I'm just one of you. I'm also an entrepreneur right now. I just started a small company, so very excited to be here. ImageNet was...
Yeah, you're right. We actually conceived that almost 18 years ago. Time really flies. I was a first year assistant professor at Princeton. Oh, wow. Hi. Hi, Tigers.
Yeah, and the world of AI and machine learning was so different at that time. There was very little data. Algorithms, at least in computer vision, did not work. There was no industry. As far as the public was concerned, the word AI doesn't exist.
But there is still a group of us, starting from the founding fathers of AI, right? John McCarthy, and then we go through people like Jeff Hinton. I think we just had an AI dream. We really, really want to make machines to think and to work. And with that dream, my own personal dream was to make machines see.
Because seeing is such a cornerstone of intelligence. Visual intelligence is not just perceiving, it's really understanding the world and do things in the world. So I was obsessed with the problem of making machines see. And as I was obsessively developing machine learning algorithms, at that time we did try neural network, but it didn't work. We pivoted to base net, to support vector machines, whatever it was. But one problem always hardwired
haunted me and it was the problem of generalization. If you work in machine learning, you have to respect that generalization is the core mathematical foundation or goal of machine learning. And in order to generalize, these algorithms, these data,
Yet no one had data at that time in computer vision. And I was the first generation of grad student who was starting to dabble into data because I was the first generation of graduate student who saw the internet, the big internet of things. So fast forward,
Around 2007-ish, my student and I decided that we have to take a bold bet. We have to bet that there needs to be a paradigm shift in machine learning, and that paradigm shift has to be led by data-driven methods. And there was no data, so we're like, okay, let's go to the internet, download
a billion images, that's the highest number we can get on the internet, and then just create the world's, the entire world's visual taxonomy. And we use that to train and benchmark machine learning algorithm. And that was why ImageNet was conceived and
came to life. And it took a while until there were algorithms that were promising. It wasn't until 2012 when AlexNet came out, and that was the second part of the equation with getting to AI, was getting the compute and throwing enough at it and algorithms. Tell us about what was that moment where you started to see, oh, you seeded it with data, and now the community started to figure more things out for AI. Right.
Right. So between 2009, we published this tiny little CVPR poster. In 2009 to 2012, the Alex that there were three years that we
We really believe that data will drive AI, but we had very little signal in terms of if that was working. So we did a couple of things. One is we open sourced. We believed from the get-go we have to open source this to the entire research community for everybody to work on this.
The other thing we did is we created a challenge because we want the whole world's smartest students and researchers to work on this problem. So that was what we call the ImageNet challenge. So every year we release a testing dataset. Well, the whole ImageNet is there for training, but we release testing and then we invite everybody openly to participate.
And then the first couple of years was really setting the baseline. You know, the performance was in a 30% error rate. It wasn't zero or I mean, it wasn't completely random, but it wasn't that great. But the third year, 2012, I, you know, I wrote this in a book that I published, but I still remember it was, you know,
It was around the end of summer that we were taking all the results of ImageNet Challenge and running it on our servers. And I remember it was late night. One day I got a ping from my graduate student. I was home and said, "We got a result that really, really stand out and you should take a look."
And we looked into it. It was convolutional neural network. It wasn't called Alex at that time. That team, that Jeff Hinton's team was called Supervision. It was a very clever play of the word super as well as supervised learning. So Supervision. And we looked at what Supervision did. It was an old algorithm.
convolutional neural network was published in the 1980s. There was a couple of tweaks in terms of the algorithm, but it was pretty surprising at the beginning for us to see that there was such a step change. And of course, we, you know, we, I mean, the rest of the history, you all know, we presented this in the ImageNet Challenge workshop in that year's
ICCV Flores, Italy, and Alex Krushevsky came, and many people came. I remember Yann LeCoultre also came. And now the world knows this moment as the ImageNet challenge, AlexNet moment.
I do want to say that it's not just convolutional neural network. It was also the first time that two GPUs were put together by Alex and his team and were used for the computing of deep learning. So it was really the first moment of data, GPUs, and neural network coming together. Now, following this trend of the arc of intelligence for computer vision,
ImageNet was really the seat to solve the concept of object recognition. Then right after that, it started to also AI got to the point that could solve scenes, right? Because you had a lot of the work with your students like Andrew Caparthi being able to describe scenes. Tell us about that transition from objects to scenes.
Yeah, so ImageNet was solving the problem of you're presented with an image and then you call out objects. There's a cat, there's a chair and all that. That's a fundamental problem in visual recognition.
But ever since I was a graduate student entering the field of AI, I had a dream. I thought it was a hundred year old, a hundred year dream, which is storytelling of the world is that when humans open their eyes, you know, imagine you just open your eye in this room. You don't just see person, person, person, chair, chair, chair. You actually see a person.
you know, with screen, with stage, with people, with the, you know, the crowd, the cameras, you actually can describe the entire scene. And that's a human ability that is at the foundation of visual intelligence. And it's so critical for us to use in terms of our everyday life. So I really thought that problem will take my entire life. I, I, I,
I literally, when I graduated as a graduate student, I told myself on my deathbed, if I can create an algorithm that can tell the story of a thing, I've succeeded. That was how I thought my career would be. Imagine Alex, that moment came, deep learning took off. And then when Andre and then later Justin Johnson
enter my lab, we start to see signals of natural language and visions start to collide. And then Andres and I proposed this problem of captioning images or storytelling.
And long story short, 2015, around 2015, Andrei and I published a series of papers that was among the first with a couple of concurrent papers of making literally a computer that captioned an image. It was, I almost felt like
What am I going to do with my life? That was my lifelong goal. It was such an incredible moment for both of us. And last year, I gave a TED Talk, and I actually used something that Andrej
tweeted a couple of years ago, around the time he finished image captioning work, that was pretty much his dissertation, I actually joked with him. I said, "Hey, Andrei, why don't we do the reverse? Take a sentence and generate an image." And of course, he knew I was joking and he said, "Ha ha, I'm out of here." The world was just not ready.
But now fast forward, now we all know generative AI. Now we can take a sentence and generate beautiful pictures. So the moral of the story is AI has seen incredible growth. And personally, I feel I'm the luckiest person in the world because my entire career started at the very beginning of
the end of AI winter, the beginning of AI starting to take off and so much part of my own work, my own career is part of this change or helped with this change. So I feel so fortunate and lucky and in a way proud. And I think the wildest thing, even to achieve your lifelong dream of describing a season, even generating them with diffusion models,
You actually dream in bigger because the whole arc of computer vision went from objects to scenes and now this concept of world. And you actually decided to move from academia being a professor to now being a founder and CEO of World Labs. Tell us about what world is. It's even harder than scenes and objects. Yeah, it is. It is kind of wild. So...
So, of course, you all know the past. It's really hard to summarize the past five or six years. For me, we're living in such a civilizational moment of this technology's progress, right? While computer vision, as a computer vision scientist, we're seeing this incredible growth, you know, from image net to image captioning to image captioning.
generation using some of the diffusion techniques. While this is happening in a very exciting way, we also have another extremely exciting thread, which is language, which is LLMs, which is that really 2022 November, ChatGPT blasted open the door of truly working generation models that can
basically pass the Turing test and all that. So, this becomes very inspirational even for someone as old as me is to really think audaciously about what's next.
And I have a habit as a computer vision scientist, a lot of my inspiration actually come from evolution as well as brain science. I find myself in many moments of my career where I'm looking for the next North Star problem to solve. I ask
I ask myself what evolution has done or what brain development has done. And there's something that's really important to notice or to appreciate. The development of human language in evolution took about, if you're super generous, let's just say it took about 300 to 500 million years, less than a million years.
That's the length of evolution that took to develop human language. And pretty much humans are the only animals that have sophisticated language. We can argue about animal language, but really language in its totality in terms of being a tool of communication, reasoning, abstraction, it's really humans. So that took less than even half a million years. But think about vision.
Think about the capability of understanding the 3D world, figuring out what to do in this 3D world, navigate the 3D world, interact with the 3D world, comprehend the 3D world, communicate the 3D world. That journey took evolution 540 million years.
The first trilobite developed a sense of vision underwater 540 million years ago. And since then, really vision was the reason that set off this evolutionary arms race. Before vision, animals were simple. For the half billion years before vision, there's just simple animals, but the next
half billion years, 540 million years, because of the capability of seeing the world, understanding the world, evolutionary arms race began and animal intelligence just started to raise each other.
So for me, solving the problem of spatial intelligence to understand the 3D world, to generate the 3D world, to reason about the 3D world, to do things in the 3D world is a fundamental problem of AI. To me, AGI will not be complete without spatial intelligence. And I want to solve that problem.
And that involves creating world models, world models that goes beyond flat pixels, world models that goes beyond language, world model that truly capture the 3D structure and the spatial intelligence of the world and the language.
The luckiest thing in my life is no matter how old I am, I always get to work with the best young people. So I founded a company with three incredible young but world-class technologists, Justin Johnson, Ben Mildenhall, and Christoph Lassner.
And we are just going to try to solve, in my opinion, the hardest problem in AI right now.
Which is incredible talent. I mean, Chris, he was the creator of Pulsar, which was the initial seed before Gaussian Splats. There was a lot of differentiable rendering. There's Justin Johnson, your former student, who really has this super system engineering mind that got real-time neural style transfer. Then you got Ben, who was the author of NERF paper. So this is a super crack team. And
You need such a crack team because we were chatting a bit about that, that vision is actually harder than LLMs to some extent. Maybe this is a controversial thing to say because LLMs are basically 1D, right? But you're talking about understanding a lot of the 3D structures. Why is this so hard? And it's still behind language research.
You know, I really appreciate Diana, you empathize how hard our problem is. Yeah, so language is fundamentally 1D, right? Syllabus comes in sequence. I mean, this is why sequence to sequence, sequence modeling is so classic. There's something else that is language that people don't appreciate. Language is purely generative.
There's no language in nature. You don't touch language, you don't see language. Language literally comes out of everybody's head and that's a purely generative signal.
Of course, you put it on a piece of paper, it's there, but the generation, the construction, the utility of language is very, very generative. The world is far more complex than that. First of all, the real world is 3D, and if you add time, it's 4D. But let's just confine ourselves within space. It's fundamentally 3D. So that by itself is a much more combinatorially harder problem.
Second, the sensing, the reception of the visual world is a projection. Whether it's your eye, your retina, or a camera, it's always collapsing 3D to 2D. And you have to appreciate how hard it is. It's mathematically ill-posed. So you have to, this is why humans and animals have multi-sensors. And then you have to solve that problem.
And third, the world is not purely generative. Yes, we could generate virtual 3D world. It still has to obey physics and all that. There is also a real world out there. You are now suddenly dialing between generation and reconstruction in a very fluid way.
And the user behavior, the utility, the use cases are very different. If you dial all the way to generation, we can talk about gaming and metaverse and all that. If you dial all the way to real world, we're talking about robotics and all that. But all this is on the continuum of world modeling and spatial intelligence. And of course, the elephant in the room is
There's a lot of data on the internet for language. And where is the data for spatial intelligence? You know, it's all in our head, of course, but it's not as easily as accessible as language. So these are the reasons it's so hard. But frankly, it excites me because if it's easy, somebody else has solved it.
My entire career is going after problems that are just so hard, bordering delusional. And I think this is the delusional problem. Thank you for supporting that. And even thinking about this from first principles, the human brain has a lot more in the visual cortex and amount of neurons that process visual data as opposed to language.
How does that translate into the model architectures that are very different from LLMs, from what you're kind of finding out, right? Yeah, that's actually a really good question. I mean, there's still different schools of thoughts out there, right? There is a lot of what we see in LLM is really writing the writing scaling law all the way to happy ending. And you can almost
you can just brute force self-supervision all the way. Constructing world model might be a little more nuanced. The world is more structured. There might be signals that we need to use to guide it. You can call it in the shape of prior, you can call it supervision in your data, whatever it is.
I think that these are some of the open questions that we have to solve, but you're right. And also, if you think about human, first of all, we don't have all the answers even to human perception, right? How does 3D work in human vision is not a solved problem. We know mechanically the two eyes had to triangulate information, but even after that, where is the mathematical model?
that great. Humans are not that great as 3D animals. So there is a lot that is to be answered. So we are definitely at World Lab. I'm just counting on, really counting on one thing. I'm counting on we have the smartest people in the pixel world to solve this. Is it fair to say that what you're building at World Labs is these
whole new foundation models where the output are 3D worlds. What are some of the applications that you're envisioning? Because I think you listed everything from perception to generation. There's always this tension between generative models and discriminant models. So what would these 3D worlds do?
Yeah, so I'm not going to be able to talk too much about the details of world labs per se, but in terms of spatial intelligence, that's what also excites me. Just like language, the use case is so huge from
which you can think about designers, architects, industrial designers, as well as just artists, 3D artists, game developers. From creation all the way to robotics, robotic learning,
the utility of spatial intelligence model or world models is really, really big. And then there are many related industries from marketing to entertainment to even metaverse. I'm actually really, really excited by metaverse. I know so many people are kind of
Still like, it's still not working. I know it's still not working. That's why I'm excited because I think the convergence of hardware and software will be coming. So that's also another great use case down the road. I'm personally very excited that you're solving Metaverse. I gave it a try in my previous company. So I'm so excited that you're doing that now. Yeah, well, I think there's more signal. I mean, I do think...
Hardware is part of the hurdle, but you need content creation. In the metaverse, content creation needs world models. Let's switch gears a little bit. Maybe to some of the audience, they might find your transition from going from academia to now being a founder/CEO to be sudden.
But you actually have the remarkable journey through your whole life. This is not your first time you've gone zero to one. You were telling me about how you immigrated to the US and you didn't speak any English in your teens. And you even ran a laundromat for a good number of years. Tell us about how all those skills shaped who you are now. Right. I'm sure you guys are here trying to listen to how to start a laundromat.
That was when you were 19, right? Yeah, I was 19 and that was out of desperation. So I had no means of supporting my family, my parents, and I need to go to college to be a physics major at Princeton. So I started a dry cleaning shop and in Silicon Valley language, I fundraised. I was the founder CEO. I was also the cashier and all the other things and I exited. So
After seven years. You guys are very kind. I've never got claps for my laundromat, but thank you.
Anyway, I think Diana's point, especially to all of you, I look at you, I'm so excited for you because you're like literally half my age or even even, you know, maybe 30 percent of my age. And you're so talented. Just do it. Don't be afraid. You know, all my entire career, of course, I did laundromat. But even as a professor, I chose to.
A couple of times I chose to go to departments where I was the first computer vision professor.
And that was against a lot of advice. You know, as a young professor, you should go to a place where there's a community and senior mentors. Of course, I would love to have senior mentors, but if they're not there, I still have to blaze my trail, blaze my way, right? So I wasn't afraid of that. And then I did go to Google to learn a lot about business in Google Cloud and B2B and all those.
And then I started a startup within Stanford because around 2018, AI was not only taking over the industry, AI became a human problem. Humanity will always advance our technology, but we cannot lose our humanity. And I really care about creating a beacon of light in the progress of AI and try to
imagine how AI can be human-centered, how we can create AI to help humanity. So I went back to Stanford and created Human-Centered AI Institute and ran that as a startup for five years. Probably some people were not too happy I ran it as a startup for five years in a university, but I was very proud of that. So in a way,
I think I just love being an entrepreneur. I love the feeling of ground zero, like standing on ground zero. Forget about what you have done in the past. Forget about what others think of you. Just hunker down and build. That is my comfort zone. And I just love that. The other really cool thing about you, another...
On top of all the awesome things you've done, you advise a lot of legendary researchers like Andrew Caparthi, Jim Fan, who's at NVIDIA, Jia Deng, who's your co-author for ImageNet. They all went on to have these incredible careers. What really stood out about them when they were students? Advice for the audience that you could tell, ah, this person is going to change the field of AI and you could tell. So first of all, I'm the lucky one.
I think I owe more to my students than the other way around. They really make me a better person, better teacher, better researcher. And having worked with so many, like you said, legendary students is really the honor of my life.
So they're very, very different. Some of them are just pure scientists trying to hunker down and solve a scientific problem. Some of them are industrial leaders. Some of them are the greatest disseminator of AI knowledge. But I think there is one thing that unifies them and
I would encourage every single one of you to think about this. I also, for those founders who are hiring, this is also my hiring criteria, is I look for intellectual fearlessness. I think it doesn't matter where you come from, it doesn't matter what problem we're trying to solve, that courage, that fearlessness of embracing something hard
and go about it and be all in and trying to solve that in however way you want is really a core characteristic of people who succeed. I learned this from them and I really look for young people who have that and then that as a CEO at World Labs in my hiring, I look for that quality.
So you're hiring a lot for world labs too, so you're looking for that same trait, right? Yes, I get permission from Diana to say that we're hiring. So yes, so we are hiring a lot. We're hiring engineering talents, we're hiring product talents, we're hiring 3D talents, we're hiring generative model.
talents. So if you feel you're fearless and you're passionate about solving spatial intelligence, talk to me or come to our website. Cool. We're going to open it up for questions for the
next 10 minutes. Hi, Feifei. Thank you for your talk. I'm a big, big, big fan. And yeah, so my question is, more than two decades ago, you worked on visual recognition. I want to start my PhD. What should I work on so I become a legend like you are? I want to give you a thoughtful answer because I can always say, do whatever excites you. So first of all, I think AI research has changed because
Because academia, if you're starting a PhD, you're in academia. Academia no longer has most of the AI resources. It's very different from my time, right? The compute and the data are kind of really low in terms of resourcing academia. And then there are problems that industry can run a lot faster.
So as a PhD student, I would recommend you to look for those North Stars that are not on the collision course of problems that industry can solve better using better compute, better data and team science.
But there are some really fundamental problems that we can still identify in academia that it doesn't matter how many chips you have, you can make a lot of progress. First of all, interdisciplinary AI to me is a really, really exciting area in academia, especially for scientific discovery. There's just so many disciplines that can cross AI. I think that's a big area.
that one could go to. On the theoretical side, I find it fascinating that the AI capability has a 100% outrun theory. We don't know how, you know, we don't have explainability, we don't know how to figure out the causality. There's just so much in the models we don't understand that one could push forward.
And, you know, the list could go on. In computer vision, there's still representational problems we haven't solved. And also, you know, small data, that's another really interesting domain. And so, yeah, these are the possibilities.
Thank you so much Fei-Fei. Thank you, Professor Li, and congratulations again on your honorary doctorate from Yale. I was honored there to witness that moment one month ago. And my question is, in your perspective, will HCI emerge more likely as a unified, single unified model or as a multi-agent system?
The way you ask this question is already two kind of definition. One definition is more theoretical, which is define AGI as if there is an IQ test that one passes that defines AGI.
part of your the other half of your question is much more utilitarian is it functional if it's agent based what tasks can it do i struggle with this definition of agi to be honest here's why the founding fathers of ai who came together in 1956 in dartmouth you know the john mccarthy and marvin minsky of them they wanted to solve the problem of
machines that can think. And that's a problem that Alan Turing also put forward a few years earlier, 10 years or whatever earlier than them.
And that statement is not a narrow AI. It's a statement of intelligence. So I don't really know how to differentiate that funding question of AI versus this new word AGI. To me, they're the same thing. But I get it that the industry today like to call AGI as if that's beyond AI.
And I struggle with that because I feel there, I don't know what exactly is AGI different from AI. If we say today's AGI-ish system performs better than the narrower AI system in 80s, 70s, 90s or whatever, I think that's right. That's just the progression of the field. But fundamentally, I think the science of AI is the science of intelligence is to think
create machines that can think and do things as intelligently or even more intelligently as humans.
So I don't know how to define AGI. So I don't know, without defining it, I don't know if it's monolithic. If you look at the brain, it's one thing, you know, you can call it monolithic, but it does have different functionalities and you can even, there's Broca area for language, there's visual cortex, there's motor cortex. So I don't really know how to answer that question.
Hi, my name is Yashna, and I just want to say thank you. I think it's really inspiring to see a woman playing a leading role in this field. And as a researcher, educator, and entrepreneur, I wanted to ask what type of person do you think should pursue graduate school in this rapid rise of AI? That's a great question. That's a question even parents ask me. Graduate school?
is the four or five years where you have burning curiosity. You're led by curiosity. And that curiosity is so strong that there's no better other place to do it.
It's different from a startup because startup is not just... You have to be a little careful. Startup cannot be just led by curiosity. Your investors will be mad at you. It's a startup has a more focused commercial goal and some part of it is curiosity, but it's not just curiosity. Whereas for grad school, that curiosity to solve problem or to ask the right questions
is so important that I think those going in with that intense curiosity would really enjoy the four or five years, even if the outside world is passing by at the speed of light. You'll still be happy because you're there following that curiosity.
First, I want to say thank you for your time. Thank you for coming out to speak to us. You mentioned that open sourcing was a big part of the growth from ImageNet. And now with the recent release and growth of large language models, we've seen organizations taking different approaches with open source, with some organizations staying fully closed source, some organizations fully releasing their entire research stack.
being somewhere in the middle, open sourcing weights or having restrictive licenses and things of that nature. So I wanted to ask, what do you think of these different approaches to open source? And what do you believe the right way to go about open source as an AI company is?
I think the ecosystem is healthy when there are different approaches. I'm not religious in terms of you must open source or you must close source. It depends on the company's business strategy. And for example, it's clear why Meta wants to open source, right? They...
Right now, their business model is not selling the model yet. They're using it to grow the ecosystem so that people come to their platform. So open source makes a lot of sense, whereas another company that is really monetizing on the... Even monetizing, you can think about an open source tier and a closed source tier. So I'm pretty open to that.
A meta level is I think open source should be protected. I think if there is efforts of open source, both in public sector, like academia, as well as private sector, it's so important. It's so important for the entrepreneurial ecosystem. It's so important for public sector that I think that should be protected. It shouldn't be penalized.
Hi, my name is Karl. I flew in from Estonia. I have a question about data. So you called very well the shift in machine learning towards data-driven methods with ImageNet. Now that you're working on world models and you mentioned that we don't have this spatial data on the intranet, it exists only in our heads.
How are you solving this problem? What are you betting on? Are you collecting this data from the real world? Are you doing synthetic data to believe in that or to believe in good old priors? Thanks.
You should join World Labs and I'll tell you. Oh. That's a good one. Look, as a company, I'm not going to be able to share a lot, but I think it's important to acknowledge that we're taking a hybrid approach. It is really important to have a lot of data, but also have a lot of quality data, data
at the end of the day, there is still garbage in, garbage out if you're not careful with the quality of data. We'll do one last question. Hi, Dr. Li. My name is Annie, and thank you very much for speaking with us.
So in your book, The World I See, you talk about the challenges you face as an immigrant girl and woman in STEM. I'm curious to know if there's a time that you feel the moment of being a minority in the workplace, and if so, how did you manage to overcome this or persuade others? Thank you for that question. I want to be very, very careful or thoughtful in answering you because we all come from different backgrounds and how each of us feel.
is very unique. You know, it almost doesn't even matter what are the big categories. All of us have moments that we feel were the minority or the only person in the room. So of course I felt that way. Sometimes it's based on who I am. Sometimes it's based on my idea. Sometimes it's just based on, I don't know, the color of my shirt, whatever that is. I have, but
This is where I do want to encourage everybody. Maybe it is because since I was young coming to this country, I kind of have experienced, it is what it is. I am an immigrant woman. I almost developed a capability to not over index on that. I'm here just like every one of you. I'm here to learn or to do things or to create things.
Thank you. That was a great answer. And I really, all of you, you're about to embark on something or in the middle of embarking something and you're going to have moments of weakness or strangeness or I feel this every day, especially startup life. Sometimes I'm like, oh my God, I don't know what I'm doing. Just focus on doing it. Gradient descent yourself to the optimized solution.
All right. That's a great way to end it. Thank you, Dr. Lee.