We're sunsetting PodQuest on 2025-07-28. Thank you for your support!

The Robotics Revolution, with Physical Intelligence’s Cofounder Chelsea Finn

2025/3/20

No Priors: Artificial Intelligence | Technology | Startups

AI Deep Dive AI Chapters Transcript

People

Chelsea Finn

Topics

我被机器人技术对世界的潜在影响以及机器感知和智能发展问题所吸引，而机器人技术完美地融合了这两者。我在博士期间开始认真研究机器人技术，当时我们专注于神经网络控制，尝试训练神经网络将图像像素直接映射到机器人手臂的电机扭矩上。训练机器人执行特定任务相对容易，但让它在各种场景和物体中执行同一任务却极具挑战性。我一直致力于研究如何创建更广泛的数据集，利用这些数据集进行训练，并探索不同的学习方法，包括强化学习、视频预测和模仿学习。 Physical Intelligence 致力于构建一个大型神经网络模型，最终能够控制任何机器人，在任何场景中执行任何任务。与专注于单一应用的传统机器人技术不同，我们致力于解决更广泛的现实世界物理智能问题，关注泛化能力和通用型机器人。我们认为利用所有可能的数据至关重要，这不仅包括来自单个机器人的数据，还包括来自任何具有不同关节数或手臂数的机器人平台的数据，这有助于实现跨不同机器人平台的知识迁移。与语言模型不同，我们缺乏机器人运动的“维基百科”或互联网，因此需要在现实世界中收集真实机器人数据来推动机器学习的进步。实现泛化能力的关键在于收集更多样化的机器人数据，这比仅仅增加数据量更重要。我们选择开源模型和软件包，因为我们认为该领域仍处于早期阶段，并且希望支持研究发展和社区建设，从而为未来更强大的通用模型做好准备。我更担心没有人能够解决机器人技术中的难题，而不是担心竞争对手。我无法预测这些模型的首次应用领域，因为机器人技术的一个挑战在于，其输出结果通常由机器人自身自主完成，而非人类检查，这需要新的方法来容忍错误或实现人机协作。虽然人形机器人很酷，但我认为它们被高估了，因为我们目前的数据量有限，而优化数据收集效率比追求人形机器人更重要。人们低估了运动控制中的复杂性和智能性，即使是像吃麦片或倒水这样简单的动作也需要高度的复杂性和智能。一些研究成果，例如SACAN、RT2和RTX，以及LOHA，证明了在机器人技术领域取得的重大进展，这些进展推动了该领域的快速发展和新公司的涌现。我们开发了一种分层交互式机器人系统，该系统结合了高层模型（用于规划任务步骤）和低层模型（用于执行电机控制），从而能够执行更长时序的任务并与人类进行交互。虽然视觉信息已经取得了很大的进展，但我希望未来能够在机器人中加入更先进的触觉传感器和其他传感器，以提高鲁棒性和功能。与自动驾驶领域不同，机器人技术领域近期涌现了许多新的参与者，这表明该领域可能比自动驾驶领域更年轻，技术发展也更快。对于想要创办机器人公司的创业者，我的建议是快速学习，快速部署，快速迭代，并从实际经验中学习。虽然观察性数据（例如YouTube视频）对训练机器人模型很有价值，但机器人自身的身体经验对于学习至关重要，因此机器人自身的数据仍然是不可或缺的。我认为未来将会出现各种各样的机器人平台，就像厨房里有多种不同的电器一样，这将比单一类型的通用机器人更有效率。

Deep Dive

Chapters

Physical Intelligence is building a large neural network model to control any robot for any task in any scenario. Unlike other companies focusing on single applications, they aim for long-term generalizability across various robot platforms and data sources.

Physical Intelligence aims to build a general-purpose AI for robots
They focus on generalization and leverage data from various robot platforms
Their approach contrasts with traditional robotics focusing on single applications

Shownotes Transcript

Translations:

中文

Hi, listeners. Welcome to No Priors. This week, we're speaking to Chelsea Fint, co-founder of Physical Intelligence, a company bringing general purpose AI into the physical world.

Chelsea co-founded Physical Intelligence alongside a team of leading researchers and minds in the field. She's an associate professor of computer science and electrical engineering at Stanford University. And prior to that, she worked at Google Brain and was at Berkeley. Chelsea's research focuses on how AI systems can acquire general purpose skills through interactions with the world. So Chelsea, thank you so much for joining us today on No Priors. Yeah, thanks for having me. You've done a lot of really important storied work.

in robotics between your work at Google, at Stanford, etc. So I would just love to hear a little bit firsthand your background in terms of your path in the world of robotics, what drew you to it initially and some of the work that you've done. Yeah, it's been a long road. At the beginning, I was really excited about the impact that robotics could have in the world. But at the same time, I was also really fascinated by this problem of developing perception and intelligence in machines and

And robots embody all of that. And also there's sometimes there's some cool math that you can do as well that keeps your brain active, makes you think.

And so I think all of that is really fun about working in the field. I started working more seriously in robotics more than 10 years ago at this point, at the start of my PhD at Berkeley. And we were working on neural network control, trying to train neural networks that map from image pixels directly actually to motor torques.

on a robot arm. At the time, this was not very popular and we've come a long way and it's a lot more accepted in robotics and also just generally something that a lot of people are excited about. Since that beginning point, it was very clear to me that we could train robots to do pretty cool things, but that getting the robot to do

One of those things in many scenarios with many objects was a major, major challenge. So 10 years ago, we were training robots to like screw a cap onto a bottle and use a spatula to lift an object into a bowl and kind of do a tight insertion or hang up like a hanger on a clothes rack.

And so pretty cool stuff. But actually getting the robot to do that in many environments with many objects, that's where a big part of the challenge comes in. And I've been thinking about ways to make broader data sets, train on those broader data sets, and also different approaches for learning, whether it be reinforcement learning, video prediction, imitation learning, all those things.

And so, yeah, moved from spent a year at Google Brain in between my PhD and joining Stanford, became a professor at Stanford, started a lab there, did a lot of work along all these lines, and then recently started physical intelligence almost a year ago at this point. So I've been on leave from Stanford for that. And it's been really exciting to be able to

try to execute on the vision that the co-founders that we collectively have and do it with a lot of resources and so forth. And I'm also still advising students at Stanford as well. That's really cool. And I guess you started Physical Intelligence with four other co-founders and an incredibly impressive team. Could you tell us a little bit more about what Physical Intelligence is working on and the approach that you're taking? Because I think it's a pretty unique slant on the whole field and approach. Yeah. So we're trying to build...

a big neural network model that could ultimately control any robot to do anything in any scenario. And like a big part of our vision is that

In the past, robotics has focused on trying to go deep on one application and developing a robot to do one thing, and then ultimately gotten kind of stuck in that one application. It's really hard to solve one thing and then try to get out of that and broaden. And instead, we're really in it for the long term to try to address this broader problem of physical intelligence in the real world. We're thinking a lot about generalization, generalists, and...

Unlike other robotics companies, we think that...

being able to leverage all of the possible data is very important. And this comes down to actually not just leveraging data from one robot, but from any robot platform that might have six joints or seven joints or two arms or one arm. We've seen a lot of evidence that you could actually transfer a lot of rich information across these different embodiments and allows you to use data. And also, if you iterate on your robot platform, you don't have to throw all your data away. I have faced a lot of pain in the past where we got a new version of the robot and then your policy doesn't work. And

It's a really painful process to try to get back to where you were on the previous robot iteration. So, yeah, trying to build generalist robots and essentially kind of develop foundation models that will power the next generation of robots in the real world. That's really cool because, I mean, I guess there's a lot of sort of parallels here.

to the large language model world where, you know, really a mixture of deep learning, the transformer architecture and scale has really proven out that you can get real generalizability in different forms of transfer between different areas. Could you tell us a little bit more about the architecture you're taking or the approach or, you know, how you're thinking about the basis for the foundation model that you're developing? At the beginning, we were just getting off the ground. We were trying to scale data collection. And a big part of that is,

Unlike in language, we don't have Wikipedia or an internet of robot motions. And we're really excited about scaling data on real robots in the real world. This kind of real data is what has fueled machine learning advances in the past. And a big part of that is we actually need to collect that data. And that looks like teleoperating robots in the physical world. We're also exploring other ways of scaling data as well. But the kind of bread and butter is scaling real robot data.

We released something in late October where we showed some of our initial efforts around scaling data and how we can learn very complex tasks of folding laundry, cleaning tables, constructing a cardboard box. Now, where we are in that journey is really thinking a lot about language interaction and generalization to different environments. So

What we showed in October was the robot in one environment and it was trained, it had data in that environment. We did, we were able to see some amount of generalization. So it was able to fold shirts that had never seen before, fold shorts that has never seen before, but, um,

the degree of generalization was very limited and you also couldn't interact with it in any way. You couldn't prompt it and tell you what you want to do beyond fairly basic things that it saw in the training data. Being able to handle lots of different prompts in lots of different environments is a big focus right now. In terms of the architecture, we're using transformers and we are using pre-trained models, pre-trained vision language models,

And that allows you to leverage all of the rich information in the internet. We had a research result a couple of years ago where we showed that if you leverage vision language models, then you could actually get the robot to do tasks that require concepts that were never in the robot's training data, but were in the internet. Like one famous example is that you can pass the Coke can to Taylor Swift or a picture of Taylor Swift and the robot has never seen Taylor Swift in person, but the internet has lots of images of Taylor Swift in it. And you can leverage all of the information in that data and then the weights of the pre-trained model

to kind of transfer that to the robot. So we're not starting from scratch and that helps a lot as well. So that's a little bit about the approach, happy to dive deeper as well. That's really amazing. And then what do you think is the main basis then for really getting to generalizability? Is it scaling data further? Is it scaling compute? It's a combination of the two. It's other forms of post-training or something. Like, I'm just sort of curious, like as you think through the common pieces that people look at now,

I'm sort of curious what you think needs to get filled in. Obviously, on the, again, the more language model world, people are spending a lot of time on reasoning modules and other things like that as well. So I'm curious, like, what are the components that you feel are missing right now? Yeah, so I think the number one thing, and this is kind of the boring thing, is just getting more diverse robot data. So for that release that we had in late October last year, we collected data in...

three buildings, technically. The internet, for example, and everything that is fueled language models and vision models is way, way more diverse than that because the internet is pictures that are taken by lots of people and texts written by lots of different people.

And so just trying to collect data in many more diverse places and with many more objects, many more tasks. So scaling the diversity of the data, not just the quantity of the data, is very important. And that's a big thing that we're focusing on right now, actually bringing our robots into lots of different places and collecting data in it. As a side product of that, we also learn what it takes to actually collect

get your robot to be operational and functional in lots of different places. And that is a really nice byproduct because if you actually want to get robots to work in the real world, you need to be able to do that. So that's the number one thing. But then we're also exploring other things, leveraging videos of people, again, leveraging data from the web, leveraging pre-trained models, thinking about

reasoning, although more basic forms of reasoning in order to, for example, put a dirty shirt into a hamper. If you can recognize where the shirt is and where the hamper is and what you need to do to accomplish that task, that's useful. Or if you want to make a sandwich and the user has a particular request in mind, you should reason through that request. If they're allergic to pickles, you probably shouldn't put pickles on the sandwich.

things like that. So there's some basic things around there, although the number one thing is just more diverse robot data. And then I think a lot of the approach you've taken to date has really been an emphasis on releasing open source models and packages for robotics. Do you think that's the long-term path? Do you think it's open core? Do you think it's eventually proprietary models? Or how do you think about that and

in the context of the industry because it feels like there's a few different robotics companies now each taking different approaches in terms of either hardware only, I mean, excuse me, hardware plus software and they're focused on a specific hardware footprint. There's,

software and those closed source versus open source if you're just doing the software. So I'm sort of curious where in that spectrum physical intelligence lies. Definitely. So we've actually been quite open. We've not only have we open sourced some of the weights and release details and technical papers, we've actually also been working with hardware companies and giving designs of robots to hardware companies. And some people have actually like when I tell people this, sometimes they're actually really shocked that like, like, what about the IP? What about, I don't know, confidentiality and stuff like that. And we've

actually made this made a very intentional choice around this. There's a couple reasons for it. One is that we think that the field, it's really just the beginning and these models will be so, so much better and the robots should be so, so much better in a year, in three years. And we want to support the development

of the research and we want to support the community, support the robots so that when we hopefully develop the technology of these generalist models, the world will be more ready for it. We'll have better, like more robust robots that are able to leverage those models, people who have the expertise and understand what it requires to use those models. And then the other thing is also like, we have a really fantastic team of researchers and engineers and

Really, really fantastic researchers and engineers want to work at companies that are open, especially researchers where they can get kind of credit for their work and share their ideas, talk about their ideas. And we think that having the best researchers and engineers will be necessary for solving this problem. The last thing that I'll mention is that I think the biggest risk with this bet is that it won't work. Like, I'm not really worried about competitors. I'm more worried that

no one will solve the problem. Oh, interesting. And why do you worry about that? I think robotics is, it's very hard. And there have been many, many failures in the past. And unlike when you're like recognizing an object in an image, there's very little tolerance for error. You can miss a grasp on an object or like not make, like the difference between making contact and not making contact in an object is,

is so small and it has a massive impact on the outcome of whether the robot can actually successfully manipulate the object. And I mean, that's just one example. There's challenges on the data side of collecting data. Well, just anything involving hardware is hard as well. I guess we have a number of examples now of robots in the physical world

You know, everything from autopilot on a jet on through to some forms of pick and pack and or other types of robots and distribution centers. And there's obviously the different robots involved with manufacturing, particularly in automotive. Right. So there's been a handful of more constrained environments where

people have been using them in different ways. Where do you think the impact of these models will first show up? Because to your point, there are certain things where you have very low tolerance for error. And then there's a lot of fields where actually it's okay, or maybe you can constrain the problem sufficiently relative to the capabilities of the model that it works fine. Where do you think physical intelligence will have the nearest term impact or in general, the field of robotics and these new approaches?

will substantiate themselves. Yeah, as a company, we're really focused on the long-term problem and not at like any one particular application because of the failure modes that can come up when you focus on one application. I don't know where the first applications will be. I think one thing that's actually challenging is that

And typically in machine learning, a lot of the successful applications of like recommender systems, language models, like image detection, a lot of the consumers of that, of the model outputs are actually humans who could actually check it. And the humans are good at the thing. A lot of the very natural applications of robots is actually the robot doing something autonomously on its own, where it's not like a human consuming the thing.

commanded arm position, for example, and then checking it and then validating it and so forth. And so I think we need to think about new ways of having some kind of tolerance for mistakes or scenarios where that's fine or scenarios where humans and robots can work together. That's, I think, one big challenge that will come up when trying to actually deploy these. And some of the language interaction work that we've been doing is actually

motivated by this challenge where we think it's really important for humans to be able to kind of provide input for how they want the robot to behave and what they want the robot to do, how they want the robot to help in a particular scenario. That makes sense. I guess the other form of generalizability to some extent, at least in our

current world is the human form, right? And so some people are specifically focused on humanoid robots like Tesla and others under the assumption that the world is designed for people and therefore is the perfect form factor to coexist with people. And then other people have taken very different approaches in terms of saying, well, I need something that's more specialized for the home in certain ways or for

factories or manufacturing or you name it, what is your view on kind of humanoid versus not? On one hand, I think humanoid are really cool and I have one in my lab at Stanford. On the other hand, I think that they're a little overrated. And one way to practically look at it is I think that we're generally fairly bottlenecked on data right now. And some people argue that with humanoids,

you can maybe collect data more easily because it matches the human form factor. And so maybe it'd be easier to mimic humans. And I've actually heard people make those arguments, but if you've ever actually tried to teleoperate a humanoid, it's actually a lot harder to teleoperate than a static manipulator or a mobile manipulator with wheels. Optimizing for being able to collect data, I think is very important because if we can get to the point where we have

more data than we could ever want, then it just comes down to research and compute and evaluations.

And so we're optimizing for, that's one of the things we're kind of optimizing for. And so we're using cheap robots. We're using robots that we can very easily develop teleoperation interfaces for, in which you can do teleoperation very quickly and collect diverse data, collect lots of data. Yeah, it's funny. There was that viral fake Kim Kardashian video of her going shopping with a robot following her around, carrying all of her shopping bags. When I saw that, I really wanted a humanoid robot to follow me around everywhere.

And that'd be really funny to do that. So I'm hopeful that someday I can use your software to cause a robot to follow me around to do things. So exciting future. How do you think about the embodied model of development versus not on some of these things in terms of that's another sort of, I think, set of trade-offs that some people are making or deciding between? Well, the AI community is very focused on development.

just like language models, vision language models and so forth. And there's like a ton of hype around like reasoning and stuff like that. Oh, let's create like the most intelligent thing. I feel like actually people underestimate how much intelligence goes into motor control. Many, many years of evolution is what led to us being able to use our hands the way that we do. And there are many animals that can't do it, even though they had so much, so many years of evolution. And so I think that there's actually so much complexity

complexity and intelligence that goes into being able to do something as basic as like make a bowl of cereal or pour a glass of water. And yeah, so in some ways I think that actually like embodied intelligence or physical intelligence is very core to intelligence and maybe kind of underrated compared to some of the less embodied models. One of the papers that I really loved over the last couple of years in robotics was your Aloha paper.

And I thought it was a very clever approach. What is some of the research over the last two or three years that you think has really caused this flurry of activity? Because I feel like there's been a number of people now starting companies in this area because a lot of people feel like now's the time to do it.

And I'm a little bit curious what research you feel was the basis for that shift and people thinking this was a good place to work. At least for us, there were a few things that we felt like were turning points that felt like where it felt like the field was moving a lot faster compared to where it was before. One was...

The SACAN work, where we found that you can plan with language models as kind of the high-level part, and then kind of plug that in with a low-level model to get a model to do long-horizon tasks. One was the RT2 work, which showed that you could do the Taylor Swift example that I mentioned earlier and be able to plug in kind of a lot of the web data and get better generalization on robots. A third was our RTX work, where...

We actually were able to train models across robot embodiments and significantly, we basically took all the robot data that different research labs had. It was a huge effort to aggregate that into a common format and train on it. And we also, when we trained on that, we actually found that we could take a checkpoint, send that model checkpoint to another lab halfway across the country, and the grad student at that lab could run the checkpoint on the robot and it would actually...

more often than not do better than the model that they had specifically iterated on themselves in their own lab. And that was like another big sign that like this stuff is actually starting to work and that you can get benefit across by pooling data across different robots. And then also, like you mentioned, I think the LOHA work and later the mobile LOHA work was work that showed that you can teleoperate and get models to train pretty complicated dexterous manipulation tasks. We also had a follow-up paper with the shoelace tying that was a

a fun project because someone said that they would retire if they saw a robot tie shoe laces. So did they retire? Uh, they did not retire. We need to force them into retirement. Whoever that person is, we need to follow up on that. Yeah. So those are a few examples. Uh, and, and so, yeah, I think we've seen a ton of progress in the field. I also, it, it seems like, um, after we started pie that that was also kind of assigned to others that if the experts are really willing to bet on this, then, um,

something, maybe something will happen. So one thing that you all came out with today from Pi was what you call a hierarchical interactive robot or high robot. Can you tell us a little bit more about that? So this was a really fun project. There's two things that we're trying to look at here. One is that if you need to do like a longer horizon task, meaning a task that might take minutes to do, then you

If you just train a single policy to output actions based on images, like if you're trying to make a sandwich and you train a policy that's just

outputting the next motor command, that might not do as well as something that's actually kind of thinking through the steps to accomplish that task. That was kind of the first component. That's where the hierarchy comes in. And the second component is a lot of the times when we train robot policies, we're just saying like, we'll take our data, we'll annotate it and say like, this is picking up the sponge. This is putting the bowl in the bin. This segment is, I don't know, folding the shirt. And then you get a policy that can like follow those basic commands of like fold the shirt or...

pick up the cup, those sorts of things. But at the end of the day, we don't want robots just to be able to do that. We want them to be able to interact with us where we can say like, "Oh, I'm a vegetarian. Can you make me a sandwich? Oh, and I'm allergic to pickles. So maybe don't include those."

And maybe also be able to interject in the middle and say like, "Oh, hold off on the tomatoes or something." It's actually kind of a big gap between something that can just follow an instruction like pick up the cup and something that could be able to handle those kinds of prompts and those situated corrections and so forth. And so we developed a system that basically has one model that takes us and put the prompt and kind of reasons through, is able to output the next step.

that the robot should follow. And that might be, that's kind of like, it's going to tell it to, then the next thing will be pick up the tomato, for example. And then a lower level model that takes its input, pick up the tomato and outputs the sequence of motor commands for the next like half second.

That's the gist of it. It was a lot of fun because we actually got the robot to make a vegetarian sandwich or a ham and cheese sandwich or whatever. We also did a grocery shopping example and a table cleaning example. And I was excited about it at first because it was just like cool to see the robot be able to respond to different problems and do these challenging tasks. And second, because it actually seems like the right approach for solving the problem. On the technical capabilities side, one thing I was wondering about a little bit was

If I look at the world of self-driving, there's a few different approaches that are being taken. And one of the approaches that is the more kind of Waymo-centric one is really incorporating a variety of other types of sensors besides just vision. So you have LiDAR and a few other things as ways to augment the self-driving capabilities of a vehicle. Where do you think we are in terms of the sensors that we use in the context of robots? Is there anything missing? Is there anything we should add or there are

types of inputs or feedback that we need to incorporate that haven't been incorporated yet? So we've gotten very far just with vision, with RGB images even. And we typically will have one or multiple external kind of what we call base cameras that are looking at the scene and also cameras mounted to each of the wrists of the robot. We can get very, very far with that.

I would love if like skin, if we could give our robots skin. Unfortunately, a lot of the tactile sensors that are out there are either far less robust than skin, far more expensive or very, very low resolution. And so there's a lot of kind of challenges on the hardware side there. And we found that actually that mounting RGB cameras to the wrists is

ends up being very, very helpful and probably giving you a lot of the same information that tactile sensors can give you. Because when I think about the set of sensors that are incorporated into a person, obviously to your point, there's the tactile sensors effectively, right? And then there's heat sensors. There's actually a variety of things that are incorporated that people usually don't really think about much. Absolutely. And I'm just sort of curious, like how many of those are actually necessary in the context of robotics versus not? What are some of the things we should think about? Like just if we extrapolate off of

humans or animals or other, you know. It's a great question. I mean, for the sandwich making, you could argue that you'd want the robot to be able to taste the sandwich to know if it's good or not. Or smell it at least, you know. Yeah. I've made a lot of arguments for smell to, uh, to Sergei in the past because it's, there's a lot of nice things about smell, although you've never actually attempted it before. Yeah. In some ways, the redundancy is nice. Uh,

And I think like audio, for example, like a human, if you hear something that's unexpected, it can actually kind of alert you to something. In many cases, it might actually be very, very redundant with your other sensors because you might be able to actually see something fall, for example. And that redundancy can lead to robustness. For us, it's not redundant.

currently not a priority to look into these sensors because we think that the bottleneck right now is elsewhere, is on the data front, is on kind of the architectures and so forth. The other thing I'll mention is actually right now we're most like our policies right now do not have any memory. They only look at the current image frame. They can't remember even half a second prior. And so I would much rather add memory to our models before we add other sensors. We can have

commercially viable robots for a number of applications without other sensors. What do you think is the timeframe on that? I have no idea. Yeah. Some parts of robotics that make it easier than self-driving and some parts that make it harder. On one hand,

It's harder because you're not just like it's just a much higher dimensional space. Like even our static robots have 14 dimensions of seven for each arm. You need to be more precise in many scenarios than driving. We also don't have as much data right off the bat. On the other hand, with driving, I feel like you kind of need to solve the complexity

entire distribution to have anything that's viable. You have to be able to handle an intersection at any time of day or with any kind of possible pedestrian scenario or other cars and all that. Whereas in robotics, I think that there's lots of

commercial use cases where you don't have to handle this whole huge distribution. And you also don't have as much of a safety risk as well. That makes me optimistic. And I think that also like all the results in self-driving have been very encouraging, especially like the number of Waymos that I see in San Francisco. Yeah, it's been very impressive to watch them scale up usage. The thing I found striking about the self-driving world is

You know, there was two dozen startups started roughly, I don't know, 10 to 15 years ago around self-driving.

And the industry is largely consolidated, at least in the U.S. And obviously the China market's a bit different, but it's consolidated into Waymo and Tesla, which effectively were two incumbents, right? Google and Tesla was an automaker. And then there's maybe one or two startups that either SPACed and went public or are still kind of working in that area. And then most of it's kind of fallen off, right? And the set of players that existed at that starting moment 10, 15 years ago is kind of the same players that ended up actually winning, right? There hasn't been a lot of dynamism yet.

in the industry other than just consolidation. Do you think that the main robotics players are the companies that exist today? And do you think there's any sort of incumbency bias that's likely? A year ago, like it would be completely different. And I think that we've had so many new players recently. I think that the fact that self-driving was like that suggested that it might have been a bit too early.

10 years ago. And I think that arguably it was. I think deep learning has come a long, long way since then. And so I think that that's also part of it. And I think that the same with robotics. If you were to ask me 10 years ago or even...

Even five years ago, honestly, I think it would be too early. I think the technology wasn't there yet. We might still be too early for all we know. I mean, it's a very hard problem. And how hard self-driving has been, I think, is a testament to how hard it is to build intelligence in the physical world. In terms of major players, there's a lot of things that I've really liked about the startup environment and a lot of things that were very hard to do when I was at Google. And Google is an amazing place in many, many ways. But as one example,

Taking a robot off campus was like almost a non-starter just for code security reasons. And if you want to collect diverse data, taking robots off campus is valuable. You can move a lot faster when you're a smaller company, when you don't have...

kind of restrictions, red tape, that sort of things. The really big companies, they have a ton of capital and so they can last longer. But I also think that there's, they're going to move slower too. If you were to give advice to somebody thinking about starting a robotics company today, what would you suggest they do or where would you point them in terms of what to focus on? I think the main advice that I would give someone trying to start a company would be to...

try to learn as much as possible quickly. And I think that actually like trying to deploy quickly and learn and iterate quickly, that's probably...

the main advice and try to, yeah, like actually get the robots out there, learn from that. I'm also not sure if I'm the best person to be giving startup advice because I've only been an entrepreneur myself for 11 months. But yeah, that's probably the advice I'd give. That's cool. I mean, you're running an incredibly exciting startup, so.

I think you have a full ability to suggest stuff to people in that area for sure. I've heard a number of different groups doing is really using observational data of people as part of the training set. So that could be YouTube videos. It could be things that they're recording specifically for the purpose. How do you think about that in the context of training robotic models? I think that data can have a lot of value, but I think that by itself, it won't get you very far. And I think that there's actually some really nice analogies you can make where, you know,

For example, if you watch like an Olympic swimmer, swim a race, even if you had their strength, just their practice at moving their own muscles to accomplish what they're accomplishing is like essential for being able to do it. Or if you're trying to learn how to hit a tennis ball well, you won't be able to learn it by kind of watching the pros.

Now, maybe these examples seem a little bit contrived because they're talking about like experts. The reason why I make those analogies is that we humans are experts at motor control, low-level motor control already for a variety of things that our robots are not. And I think the robots actually need experience from their own body in order to learn. And so I'm

I think that it's really promising to be able to leverage that form of data, especially to expand on the robot's own experience. But it's really going to be essential to actually have the data from the robot itself, too. In some of those cases, is that just general data that you're generating around that robot? Or would you actually have it mimic certain activities? Or how do you think about the data generation? Because you mentioned a little bit about the transfer and generalizability. It's interesting to ask, well, what is generalizable or not? And what types of data are and aren't and things like that?

that. I mean, when we collect data, we have it's kind of like puppeteering, like the original Aloha work. And then you can record both the actual motor commands and the sensor, like the camera images. And so that is

the experience for the robot. And then I also think that autonomous experience will play a huge role, just like we've seen in language models after you get an initial language model. If you can use reinforcement learning to have the language model bootstrap on its own experience, that's extremely valuable. Yeah, and then in terms of what's generalizable versus not, I think it all comes down to the breadth of the distribution. It's really hard to quantify or measure how broad the robot's own experience is.

And there's no way to categorize the breadth of the tasks, like how different one task is from another, how different one kitchen is from another, that sort of thing. But we can at least get a rough idea for that breadth by looking at things like the number of buildings or the number of scenes, those sorts of things. And then I guess we talked a little bit about humanoid robots and other sort of formats. If you think ahead...

in terms of the form factors that are likely to exist in N years as this sort of robotic future comes into play? Do you think there's sort of one singular form or there are a handful? Is it a rich ecosystem, just like in biology? Like, how do you think about what's going to come out of all this? I don't know exactly, but I think that my bet would be on something where there's actually a...

a really wide range of different robot platforms. I think Sergey, my co-founder, likes to call it a Cambrian explosion of different robot hardware types and so forth. Once we actually can have the technology that can, the intelligence that can power all those different robots. And I think it's kind of similar to like,

We have all these different devices in our kitchen, for example, that can do all these different things for us. And rather than just like one device that cooks the whole meal for us. And so I think we can envision like a world where there's like one kind of robot arm that does things on the kitchen that has like some hardware that's optimized for that. And maybe also optimized for it to be cheap for that particular use case. And another...

hardware that's kind of designed for folding clothes or something like that, dishwashing, those sorts of things. This is all speculation, of course, but I think that a world like that is something where, yeah, it's different from what a lot of people think about. In the book, The Diamond Age, there's sort of this view of matter pipes going into homes and you have these 3D printers that make everything for you. And in one case, you're downloading schematics and then you 3D print the thing.

And then people who are kind of bootlegging some of this stuff end up with almost evolutionarily based processes to build hardware and then select against certain functionality as the mechanism by which to optimize things. Do you think a future like that is at all likely or do you think it's more just, hey, you make the foundation model really good, you have a couple of form factors and, you know, you don't need that much specialization if you have enough generalizability in the actual process?

underlying intelligence. I think a world like that is very possible. And I think that you can make a cheaper hardware, piece of hardware, if you are optimizing for a particular use case and maybe it'd be like also be a lot faster and so forth. Yeah, obviously very hard to predict. Yeah, it's super hard to predict because one of the arguments for a smaller number of hardware platforms is just supply chain, right? It's just going to be cheaper at scale, right?

to manufacture all the subcomponents and therefore you're going to collapse down to fewer things because unless there's a dramatic cost advantage those fewer things will be more

easily scalable, reproducible, cheap to make, et cetera, right? If you look at sort of general hardware approaches. So it's an interesting question in terms of that trade-off between those two tensions. Yeah, although maybe we'll have robots in the supply chain that can manufacture any customizable device that you want. It's robots all the way down. So that's our future. Yeah. Well, thanks so much for joining me today. It was a super interesting conversation. We covered a wide variety of things. So I really appreciate your time. Yeah, this was fun.

Find us on Twitter at NoPriorsPod. Subscribe to our YouTube channel if you want to see our faces. Follow the show on Apple Podcasts, Spotify, or wherever you listen. That way you get a new episode every week. And sign up for emails or find transcripts for every episode at no-priors.com.

The Robotics Revolution, with Physical Intelligence’s Cofounder Chelsea Finn 35:14 Share

No Priors: Artificial Intelligence | Technology | Startups

Deep Dive

Shownotes Transcript

The Robotics Revolution, with Physical Intelligence’s Cofounder Chelsea Finn