We're sunsetting PodQuest on 2025-07-28. Thank you for your support!

Ep 56: Distinguished Engineer at Waymo Vincent Vanhoucke Unpacks the Breakthroughs and Bottlenecks of Self-Driving

2025/2/26

Unsupervised Learning

AI Deep Dive AI Chapters Transcript

People

Jacob Efron

Vincent Vanhoucke

Topics

Jacob Efron: 本期节目讨论了大型语言模型 (LLM) 如何改变自动驾驶和机器人技术，Waymo 如何看待其车辆传感器数量，以及未来几年将决定该领域发展轨迹的关键技术。 Vincent Vanhoucke: 大型语言模型的进步增强了 Waymo 的自动驾驶技术，但并非完全取代原有系统。LLM 主要用于构建“教师模型”，利用大量数据训练云端模型，再将数据提炼到车载模型中，提升系统性能。LLM 为自动驾驶系统带来了“世界知识”，例如识别不同地区的警车和紧急车辆，弥补数据收集的不足。然而，对于安全、法规等方面，需要明确的规则和验证机制，不能完全依赖 AI 模型。我加入 Waymo 的原因是亲身体验了其自动驾驶服务的便捷性和易用性。自动驾驶和机器人技术的核心问题都是感知、规划和执行，但自动驾驶更关注规模化和商业化应用。自动驾驶领域的主要挑战在于规模化，需要解决在行驶数百万英里后出现的各种长尾问题。解决这些问题需要结合仿真和现实世界数据，并通过模拟各种极端情况来提升模型的鲁棒性。可靠的物理真实世界模型是自动驾驶技术未来可能取得突破的关键。Waymo 在自动驾驶世界模型研究方面，既关注外部研究成果，也进行内部研发，并通过发布开放数据集来引导研究方向。Waymo 的自动驾驶模型在不同城市具有良好的可移植性，但仍需针对不同城市进行评估和验证，以确保模型的鲁棒性。Waymo 使用多种传感器（摄像头、激光雷达和雷达）来实现自动驾驶，这些传感器相互补充，提高了系统的可靠性。未来几年，传感器技术和模型性能的提升将共同决定自动驾驶汽车所需的传感器数量。自动驾驶领域的下一个重要里程碑是实现大规模商业化应用，并证明其在不同地理位置的可靠性。 Jacob Efron: 大型语言模型的应用为机器人技术带来了惊喜，特别是将自然语言描述转化为机器人可执行的计划的能力。未来几年，机器人技术的发展方向可能是既有通用的机器人模型，也有针对特定任务优化的模型。在机器人技术中，模拟数据和真实世界数据的应用存在争议，模拟数据在运动和导航方面效果较好，但在操作方面效果较差。大规模数据采集是机器人学习的关键瓶颈，需要探索更高效的人机交互方式来加速数据采集过程。大型多模态模型能够将视觉信息传递给机器人，从而简化数据采集过程，但仍然需要解决运动数据采集的问题。未来几年，机器人技术领域的关键问题包括运动泛化能力、机器人与其他 AI 领域的差异以及是否存在适用于机器人技术的规模化定律。计算机视觉技术在开放环境中的性能仍然有待提高，其应用前景可能与机器人技术密切相关。大型语言模型的推理能力超出了预期，其易用性将对人们的生活和工作产生深远影响。大型语言模型的测试时计算能力非常强大，但其应用范围和对其他领域的影响还有待观察。大型语言模型在需要多步骤推理和信用归属的领域具有应用潜力，例如强化学习。未来一两年，大型语言模型和机器人技术领域的关键问题包括世界模型的构建、模型架构的改进以及规模化定律的探索。大型语言模型的进步将对教育领域产生深远影响，但目前人们对此的讨论还不够充分。AI 技术在非技术领域的应用也具有广阔前景，例如食品工业。

Deep Dive

Shownotes Transcript

Translations:

中文

- Vincent Vanuc is a distinguished engineer at Waymo, which he joined after founding and leading the robotics team at Google. I'm Jacob Efron, and today on Unsupervised Learning, we got Vincent's takes on both of these spaces. He talked about how LLMs have changed autonomous vehicles and robotics. He talked about how Waymo thinks about the amount of sensors they have on their car and what the key milestones are left in the autonomous vehicle space. And then we also hit on AI and robotics, what the state of the space is today, and what we'll learn over the next few years that will determine the trajectory going forward.

These are topics I've wanted to hit for a long time on Unsupervised Learning and can't think of a better guest than Vincent to talk through them. I think you'll really enjoy the conversation. Without further ado, here's Vincent.

Well, thanks so much for coming on the podcast. Really appreciate it. Thanks for inviting me. I've been looking forward to this one for a while. I feel like a lot of different places we'll dive into today, but I figure we'll start with your current employer and Waymo. Waymo was working on self-driving well before this kind of like explosion in advances of LLMs, diffusion models, everything that's happened in the last like, you know, four or five years. And I'm wondering to what extent that's changed the way that Waymo's technology works. Well, the interesting thing is that to some degree, nothing has to be thrown out.

it's something that comes on top. The current sort of foundation model revolution is

is about building teacher models, like very large scale models that can run in the cloud that basically suck in all the data that we have available in addition to internet data and to build a very large model of the Waymo driver and the behavior of the car and the environment.

You can take that teacher and use it to train and distill the data into the onboard models that you have on the car without necessarily having to do a lot of retrofitting. You can, as a starting point, not change anything and just use a different mode of supervision for those models.

Then you can also evolve then on that basis and try to get models that are bigger, higher capacity, higher expressiveness and things like this. But that's on top, right? So you don't throw away anything. You just give every model a better teacher and a lot more information to work from, basically. Wow.

How has that actually been manifested? So you had this kind of way of doing things before and in a previous set of models you worked with. How have LLMs, VLLMs kind of made their way into the Waymo stack? The big thing that LLMs and VLLMs can bring to the table is

The first thing is really what we call world knowledge, essentially, is the semantic understanding of the world around it. The kind of things that are very obvious to you and I, what a police car looks like, what an emergency vehicle looks like.

that the data that we collect from driving may not have experienced, right? Imagine we go into a new city, their police cars look a little different. Until we experience that, we wouldn't have that data. Those models know what a general police car looks like, what a general emergency vehicle looks like. And so that world knowledge...

And there are many examples of world knowledge. I'm just picking vehicles as an example, but just learning about what an accident scenes look like, right? We don't see a lot of those in our data, but if you ask a large multimodal model like Gemini or GPT, you know, you give them a picture of a scene with an accident, they will be able to recognize that this is what's happening and that's the semantic context there.

that's relevant to this. So bringing all that knowledge from essentially the web into the driver to be able to give it more capabilities is essentially what we're seeking out of this in addition to just scaling up.

So those models are very large. They're pre-trained on a lot of visual data. They're pre-trained on a lot of text data that enhances their reasoning capabilities. You can just leverage that scale, you know, just bigger is always better, essentially, and use that as a vehicle to build models that are just...

Conversely, what parts of the overall problem of self-driving are these models not actually that helpful for? You can distill every aspect of the driving problem into essentially a machine learning model. There are things that you want to also build on top of it on the outside of an AI model.

Typically, anything that has to do with strict contracts on safety, on regulatory constraints and things like this, you want to be able to express those in a very explicit way, not in an indirect, implicit way, so that you can convince yourself that once the car drives and the AI model basically proposes a plan of driving, you can verify that this plan works.

meets your requirements in terms of safety, compliance, and just general or just general be well behaved.

Doing that from the outside gives you a very strong sort of framework for guaranteeing that the car behaves in a way that is reasonable at all times, while still enabling you to use the power of AI to come up with a plan, come up with a good idea in terms of what is the right strategy for driving.

Yeah, super interesting. It's kind of putting, yeah, basically that checking layer of guardrails around the output of the reasoning models. Makes a ton of sense. I guess, you know, you had a pretty storied run over at Google in the robotics world, you know, kind of involved in, I think, a lot of the seminal work in the space. And then, you know, you chose to make the decision over to Waymo. What drove that transition? And, you know, maybe talk a little bit about what got you so excited. Yeah, it was...

It was really accidental, pun intended, in the sense I got into an accident last year. Nothing bad, but it put me out of commission for a few months. And during that time, I had to not work, but I had to go to lots of physical therapy appointments.

And since I live in the city and I couldn't drive at the time, I basically took Wemos everywhere. And I really found the product to be amazing. It's a pretty magical experience. Yeah, it's completely magical. It was the first time that I had this rapport with essentially an AI system that I thought touched everyone. Yeah.

that was so easy to use, no fancy UI. It felt like the kind of AI that really applied to everybody. And that felt really exciting and magical to me. And then I...

I got back to work after a couple of months of being out. And I also have a fantastic team back at DeepMind who had basically picked up the reins while I was away and was doing just fine on their own. I took the opportunity to do something different.

knowing that the team was in good hands. I'm curious if you could talk about maybe the similarities and differences with the robotics models work you were working. I mean, at the highest level, it feels like a lot of these problems are perception, planning, and actuation. I'm sure in the details, there are lots of things that are bespoke to the self-driving world that may be different than some of the robotics work you did. Yeah, at the core, the problem is the same in the sense that an autonomous car is a robot, right? It has the same kind of inputs and outputs. You just...

have sensors, cameras as the input and you have actuation. So, you know, turning the wheel and pressing on the gas pedal as the output. It's very similar to a manipulation robot that needs to see from its camera sensors and act with arms and hands and fingers. What really...

The big difference is the operational domain. As long as I was working on robotics and AI, I was in more of a research environment where we were still chasing essentially the nominal behavior. We're still chasing...

how do we get a robot to even do the thing we want it to do? Like pick up an object or tie a shoelace or go make coffee. It's always go make coffee. I feel like everyone loves that demo for some reason. Go make coffee is great. It's amazing. It's a really good sort of a demo of where the capabilities can be and are not quite yet.

And so just chasing that was really the name of the game. In the autonomous driving space, we know how to drive. We have a nominal system that works reasonably well and that is at the level of safety and performance and quality that gets us into commercial products already. All the challenges here are

are about scaling. That would you kind of classify the kind of state of autonomous vehicles today in terms of, you know, what we can do, what we can't do, and what are those kind of like challenges on the horizon or things that still need to be solved? There aren't that many big blockers in the sense that, for example, we don't drive, we don't drive in snow right now, right? Mostly

due to lack of attention so far, it was not the most pressing thing to work on in the areas that we were in. But this is something that requires more developing new, more capabilities. Most of the problems like this, like, you know, fog or even driving on highways, we started driving on highways and testing there, have been knocked down over time.

And now the big challenges really are about essentially scaling. All the issues that have to do with what happens when you drive millions of miles.

The long tail of problems that you have to deal with at that scale kind of dominates the equation of what problems you have to solve, right? You can imagine if you as a driver, you experience a thing maybe once in your lifetime, right?

we will experience that thing pretty much every week or maybe every month. And so all of the things that are exceptional and weird and difficult are essentially becoming common occurrences for us and are kind of putting pressure on our scaling.

So solving for this long tail is really what we're focused on and what we're hoping that AI and sort of large model capabilities can help us accelerate. And how do you go about solving that? I mean, obviously, I imagine the hard part is that there's not a massive amount of data to go get. And I assume when you're doing freeways or snow or fog, you can gather actually a decent amount of data. But for some of these long tail problems, is it mostly using simulation? Yeah, we do a lot of simulation.

and a lot of synthesizing scenarios that correspond to problems that we knew could happen, that we may never have observed in the world, but we know it's an eventuality. And so we synthesize a lot of scenarios like this and validate our models against those. And then we try also to do a lot of looking at, you know,

a lot of the situations in which nothing really bad happened,

But there was a risk that something different could have happened, and so we just modify scenarios to make them worse. You can imagine having experienced a thing in the real world, and then you dial a dial that says, "Now make all the driver drunk drivers." Or make the drivers actively adversarial against you. And how do you make

the things more difficult for the car so that you can learn and be more reactive and understand better what could happen in the worst case scenario. Are there areas of like, you know, research or technical breakthroughs that you still feel like need to happen on the autonomous vehicle side? Or is a lot of what you're describing just mostly like, you know, there's going to be this large amount of edge cases and it's just identifying them? There is one...

a technical advance that I think could completely yet again change the landscape for autonomous driving. It would be having reliable, physically realistic world models.

In order to be able to simulate the real world as it looks to you and me, with physical realism, with very accurate rendering of a scene,

In order to cross to that, I think there is a growing body of work around world models that could really unlock that. The simple world model is video prediction models, right? The things like Sora or Veo, that's kind of the proto world model in some sense. You can take a scene or an image and then push it

play and you know it unrolls into a possible future in a way that seems reasonable with the physical world that seems to be reasonable once you take that and you make it controllable you make it physically realistic while being rich and being very very plausible from both the visuals and the way things behave in the scene and

I think having essentially this digital twin of the world for autonomous driving could really sort of change the game. The question is, so there is work in that direction in research. I think there is a potential for that to become useful to some degree to autonomous driving today.

But there is still a gap in the sense that it would be most useful to deal with the long tail of the problems. Those models are not very good at long tail right now. What have you seen around world model building? How do folks build world models? It's interestingly come up mostly in the video context to date. It's come up in the video context first because you can build a world model that is not particularly physically realistic physically.

but still look really good. Yeah, and the stakes are a lot lower. Yeah. Something happens in the video world, it's just fine if your trajectory was slightly off. And that's what's happened in image generation models. In the first place, you ended up trying to chase making things look very realistic first while not trying to make them too controllable.

But now, because a lot more people are trying to start using them for authoring, for creating content, there is a lot more of a trend to trying to make them actually controllable, useful, and have very tight control.

that you can actually use and to tune the realism, to tune the style, but also to tune the geometry, the content and everything like this. So a lot of work is going in that direction very naturally. It was very natural that first, you know, the wow videos and the magical creations would come out first. Now it's the big challenge is turning that into a usable tool

a usable tool that you can use for functional uses, not just pretty pictures. What is like the limiting factor today? There is a deep question of causality at the heart of those world models, right? Right now, you can, by just learning correlation between data, you can generate very plausible videos because you

They have kind of a plausible sequence structure. You know, the objects don't disappear. They don't appear. People walk in the streets. And that seems really reasonable. As soon as you want to make things controllable, you need to make sure your model understands causality. Yeah.

And that, you know, this output derives from this change to the input or this counterfactual and things like that. And that's very hard. That's something that in general in machine learning, injecting causality into models we've always struggled with.

And now we have to really solve that if we want to have a very plausible world model. How do you then on the research side at Waymo think about, obviously, having better world models would be really important for the product. There's also tons of people around the world working on it. So I could see two schools of thought, one of which is like, hey, there's lots of awesome promising research directions. Let's stay on top of those. And as they become better and better, it's something we could put in the product. The other is, we should have our own folks working on this and pushing the frontier. It's always a trade-off between...

you know, uh, and,

pushing the state of the art by ourselves or leveraging what's been done outside for many, many tasks or many aspects of the problem. We're lucky that academia and other institutions are interested as well in the problem. And so you get a lot of leverage out of working with them.

the AV problem has a lot of bespoke problems that, uh, it, you know, people are not necessarily chasing in academia. And so we try to steer, uh, the conversation there, uh, way more as, uh,

put out, for example, the Waymo open datasets that is like the standard for AV research right now. And it was really designed in such a way that we helped steer the conversation and the research focus on the problems that we thought were relevant. We have a pretty significant research effort that is really at the

state of the art of AV right now. So we're more in a position today where we're at the forefront of AV, so we have to build the next thing. Like we can't really necessarily rely on the rest of the community to build it for us and us to sort of inherit it. One thing I've always been curious about is just like what's required when you guys enter a new city. The models are remarkably robust.

and very portable across different cities. There's always things that come up here and there, but as a general as a

In general, we found that they were remarkably robust. What's an example of something that's like different city to city? Well, as I mentioned, sometimes the emergency vehicles look different. And it's not that we don't necessarily, the model doesn't recognize them. It's more that we want to be sure. Yeah.

that we don't miss them, right? That those differences are well modeled by our model. So a lot of the time is spent on evaluation. It's not necessarily adapting the model, training it to do something. It's more convincing ourselves, convincing regulators, convincing the community that we've done our homework and that we validated that indeed the model is robust in the communities that we're entering.

There is a lot of just logistics to entering a new city, right? Sort of setting up depot. And that's why we have, for example, this partnership with Uber to try and help scale up our operations and make the deployment faster. But a lot of it is really...

making sure we have the hearts and minds of the local community and we do things in a way that is respectful and supported by the communities that we work with because really it's at the end of the day it's really about trust

And trust is kind of a lot more than just technology in that context. There have been a lot of questions around kind of the rich sensor suite that Waymo uses today and kind of the extent to which that will be necessary as models improve. I'm curious how you think about that. We use three different kinds of sensors, right? Cameras, lidars, and radars primarily. And they're remarkably complementary at

that's one of the nice features of this kind of sensor suite. They have strengths and weaknesses that

complement each other extremely well and because they're also kind of orthogonal to each other we can use that diversity to check that you know what the camera sees is correlates with what the lighter sees and if it disagrees you know we we want to look into that more deeply it's been a

There's been kind of two school of thoughts on this, right? A lot of the AV companies or companies that did driving assistance systems

I've been starting from kind of an L2 level driving, right? And I've been trying to climb their way up to L4. And the economic constraints of L2 driving are very different from L4 driving, right? When you have a fleet of cars that...

that basically earns money for every mile driven, you can afford to have more sensorization on the car compared to if you have a car that is owned by an individual where you have to really cut down on the price. So as a result of that kind of different business strategy,

Many companies have gone the route of let's start simple and cheap and try to climb our way up in terms of the complexity of the system. Waymo at some point in its history has decided, no, no, no, no. We're going to go the other way around. We're going to possibly over-sensorize and then see...

solve the hard problem, right? There is often the, you know, sometimes solving the 10x problem is easier than solving the 1x problem because you are the right north star. So solve that first and then see which

Which avenues do you have to reduce the costs and simplify? But now we have the data, right? We have the data to inform those kinds of decisions because we've solved the harder problem and we understand what matters and what doesn't.

So it'll be interesting over time how those technology stacks evolve. I think we have a good sort of way, a good trajectory with our next generation of cars to make the overall package cheaper and simpler.

and how far we can take it is going to be interesting to see. We've talked about some of the factors, how good world models get, other things. What are some of the other research areas or things we'll know in the next few years that might determine the ultimate question of four years from now, how many sensors have to be on these things? The sensor story is not just a performance problem. It's also about redundancy.

And the need for redundancy, I don't think is going to go away in some ways. And it's not clear that, you know, the sensor suite will not evolve. But I think this kind of feature of having very different sensors that give you very different information and provide this complementary signal is a pretty powerful one. The...

A lot of the argument around just using cameras, for example, has been about, you know, I can drive my car. I have eyes. I don't have a fancy LiDAR, so I can, you know, there is actual proof points that people can drive with just their eyes and you don't need anything more. The problem with that is that I...

increasingly convinced that the bar for L4 driving is not human level. It's above human level, right? And how much of that is, and we've seen from our safety reports, we are today in a place where we are safer than the average person.

and we have fewer collisions and, you know, fewer reported injuries by some significant margin, that's essentially the driver being superhuman in some way. And I think that's actually a business requirement for successful, for driving. So can we actually, you know, so the...

Will that bar ever change? I don't think so. I think we'll be in a place where we need... So the question is, can we get to that level of...

better than human driving with a simpler sensor suite and that will be sort of that will be something we'll experience in the next few years. Yeah, it's incredible to see some of like the superhuman displays of Waymo driving. In fact, there was this one viral video of like someone on a scooter falling right in front of a Waymo car and you're like, if that wasn't a Waymo driver, you know, that would have ended really badly and...

But to your point, once you've seen that, it's hard to then, as society, be like, well, we don't actually need that. Like, we could do that, but we don't necessarily need that. We take the human level. It's a lot easier to have this conversation based on data, is what I'm saying, instead of sort of litigating based on expectations. I think we now have the data, and we are going to be able to figure out exactly how performance correlates with the sensor capabilities. What are, like, the...

the milestones that matter from here, like in the autonomous vehicle world? Like what are you kind of, what do you think the next major one is? So it's funny, the milestone I have in mind right now is,

This year marks the 30th anniversary of the first transcontinental autonomous ride. Yeah, when everyone thought we were right there. Yeah. So in 1995 was the first drive across the country. And I think they had like more than 99% autonomy. And they drove across the U.S. at more than 60 miles per hour on average.

So you can imagine that based on that data, people were like, yeah, we're done. It's just a matter of tying a bow around it and we'll have autonomous driving. And it took 30 years to get to the point that we now have a commercial deployment. So I think thinking in terms of milestones and timeline, it's very hard to predict where things are at. The place that we

We have the technology validation. We know that things work well in Phoenix, in San Francisco. And we have kind of the user validation. Like, people love it. And that was not a given to me before I experienced it myself, that this was a product that people would actually just love and be attracted to, right? So the only thing that stands in...

there's not really anything that really stands in the way of this becoming a big product. And now it's really just the scaling is going to be the thing. So I think the next milestones you'll see are going to be all about

Yeah. Right. And, uh, and sort of proving out in various geographies, uh, where, uh, I'm excited, for example, about where we started, uh, driving in Tokyo, uh,

collecting data, right? That's going to be our first international experiment and the first time we drive on the left side of the road. So, seeing that is going to be an interesting, an interesting deployment. Bringing up the 1995, you know, like,

drive across the country, I actually think is a really interesting lens to transition to kind of the broader robotics space where you spent a large part of your career over at Google and DeepMind. You know, I think a lot of times there's kind of these two competing forces when I talk to people in the broader AI world about robotics where people feel like we're on the cusp of these really exciting breakthroughs and, you know, a lot has really changed in the last, you know, three, four years.

And then at the same time, everyone loves to bring up the example of the 1995 drive and be like, well, it took 30 years to actually get this stuff into a commercial product that people can use and experience the light from. And I wonder how you think about those two competing forces and how you'd characterize where we are in that space today. Yeah, it's a really good question. Like I said before, in the robotics space, we're still chasing the nominal use case. We're still chasing...

how do we get a generalist robot to do anything we want? That's kind of the problem statements that everybody is going after. And we don't have the 1995 ride yet. We have some examples of that, but I don't think we have a convincing generalist system quite yet. The

I wouldn't be surprised if we get that in the next couple of years, that we get that valid proof point in the next couple of years. And, um,

Because the progress has been really rapid. There are still some fundamental technology questions that need to be answered. We know how to generalize based on different visual inputs. We're not so good at generalizing motion. Like a lot of the demos you see of robots doing stuff, they do the one thing. It might be on...

you know, different color cups of coffee. If you're making coffee, it might be with objects that are randomized in space, but nothing that's really generalizing from a skills perspective. It's enabling robots to do things that are very different.

You may not need that to have a commercial success. That's totally fair. You could have a robot that's completely optimized for one use case, and one use case only, but does it very well, cheaply and dexterously. And that might be enough to be able to have a business behind it. But if you're...

Thinking of the general vision of the AI robots that can make your coffee and tidy up your room and pick up your clothes, before we're there, there is still some breakthroughs that are going to be needed. It feels like there's been kind of a wave of applying these LLMs to the robotics world, whether it's VLMs on the perception side or LLMs in planning or even like CodeLM on the actuation side.

I think you said originally it was kind of a surprise to you how well these things worked. I'd love to just unpack that. What did you initially think might happen and how did it surprise you? The big surprise to me was the fact that we could so quickly go from having a chatbot

describe what it means to you to make coffee, right? To turning that into a plan that you can use on the robot, right? So what was really hard to build within the robotic environment was this common sense knowledge

of, you know, what it means to make a coffee, what it means to, you know, that you have, if you have a cup, it goes on the table, it doesn't go on the floor. That if you look for a microwave oven, it might be in the kitchen, right? This kind of things that you and I know that we don't even think about that are part of our world that are, you

It's just basic knowledge of every day. And we didn't have access to that in robotics or in AI at large for a very long time. And LLMs really sort of brought that together. And the fact that we could take

that sort of high-level knowledge and quickly turn it into something that was actionable by robots. Even though language is fuzzy, language is not very precise, but it was just precise enough to describe things that we could basically build language condition policies that would actually do things effectively and very high performance. So that was the first kind of aha moment for

oh wow, actually having the language as the backbone of robotics is not completely crazy. And then it evolved into, wait a minute, robot actions are just a different language. It's not that different from...

English or Chinese. It's just a different language that is not expressed in words, it's expressed in body actions. And if you think of robotics under that lens, suddenly you can leverage all the multi-modal, large models, multilingual, and just see the robot actions as just a different dialect of expressing yourself in the world. And all the machinery there just works.

And so that was another sort of aha moment there where suddenly everything kind of started gelling together.

Can you talk a little bit about like this question about, you know, you talked about we could have task-specific models or we could have a generalizable model. Obviously, with some of your work with RTX and other things, you've worked on kind of like cross-embodiment and this generalized robotics problem. Do you think it's likely that we find ourselves moving toward a general robotics model over the next three, four years? Or does it feel like something where actually, you know, the most immediate response

value will be provided through much more like specialized focused ones. I think it's going to be both in the sense that, you know, you want a generalized teacher, right? You want a generalized sort of backbone model that is easily retargetable, right?

and that can be optimized for a single task. It's a little bit the same paradigm as in large language models where you have instruction tuning that enables you to develop very generic capabilities that are kind of related to the task that you might want to use your LLM for at the end of the day, but they're not necessarily the same task exactly.

But having this instruction tuned model enables you to quickly adapt your LLM to whatever the task you have in mind, either through prompting or through fine tuning or different strategies. So I think we're going to end up in the same position in robotics where you want to build a generalized robot model

And then have the right tools to target it very specifically to specific tasks, even possibly on the fly, right? If you could do that, you know, prompting style at test time, as opposed to doing a training time, then you've solved essentially all the hard problems.

There's a lot of different efforts right now going on about people trying to build these kind of powerful generalized robotic models today. I'm curious how you'd kind of broadly segment the different approaches that folks are taking and any thoughts on the relative efficacy of some of these things. There is a lot of push right now. There are kind of two approaches. One way to segment the approach is some people have started very hardware-centric approaches.

I want to build the best, most capable humanoid robot in the world. And then once I have these degrees of freedom and those capabilities, I will be able to accomplish all the tasks I need. Versus people who have started software first. Let's build the intelligence and...

trust that once you have a general enough intelligence model, you'll be able to retarget it to a new platform relatively easily. The work we did with RTX kind of gave me some confidence that this path of going software first

and building a very generalized robot model that can be easily retargeted is a way to make progress fast. Because a lot of the problems in robotics still are about data. It's still about how do you acquire as much quality data as possible, as quickly as possible.

And if you think of it, putting in the critical path of that data acquisition a very expensive and wobbly robot that is very hard to operationalize, that's a really tall order. It may make sense if you have a lot of money to throw at the problem, but the

there are really limits to the scalability of that approach. So I'm, I think, mostly, I'm also saying this because we haven't solved the problem, right? It's not about hill climbing on a problem that we have kind of solved and we're trying to get better at, right?

The fundamental problem of robotic manipulation in high degrees of freedom is not solved. And so optimizing for the data collection and speed of execution is probably the most important thing to do right now. It also seems like there's kind of a debate between using purely or as much simulation versus teleoperated data. Obviously,

If you could use just simulation data, it would certainly be easier. But I think it seems like it works quite well in the locomotion context, maybe not as well in manipulation context.

Yeah, so we've really struggled with this for a long time. And in the locomotion context and navigation context, using simulation has been wonderful. The simterial gap was not large enough to be problematic. In the manipulation space, we always struggled with getting the kind of

diversity of experience and quality of the contact and performance from a simulation. Because there's a cost to simulation, right? It's not necessarily a monetary cost of buying lots of robots and operationalizing them.

It's more of a cost of setting up the simulation environment, making it diverse, making it representative, tuning the physics so that the physics are realistic. And the amount of work you have to do to get that right in the context of manipulation is really, really high.

And so my experience thus far has been that it was easier or a faster path if you could scale up your physical operations to collect lots of data in the real world and not have to deal with this simulation to reality gap versus doing the simulation. That said, I also want to say that we also took that path because we could.

And because as a research organization, you also want to take the path less trodden in some way. A lot of other research labs were a lot more invested than us in simulation, had a vested interest in visiting simulations work, for example.

And it's also a lot more accessible to people in academia to do a lot of work in simulation. So we kind of explored more the, yeah, let's scale the real world robotics and see what that part of the space can bring to the table. But I still think it's actually, it's still the better path for manipulation.

to date. In hearing you talk, it becomes so clear that some kind of flywheel to acquire data and get data at scale is going to be crucial in the same way that you're experiencing Waymo now and you have all these edge cases that you get at the millions of miles you're driving. I know you thought a lot about the way that humans ultimately interact with these robotic interfaces, especially early on.

Any early thoughts on what are effective ways to do that and what might end up being effective ways to get that flywheel going on data acquisition in the broader robotics world? Yeah, it's a really good question. I wish our colleagues in the HRI world...

would spend more time thinking about this. The HRI for data acquisition, because I think it's a very rich space and it's really the big bottleneck for a lot of robot learning today. I made that pitch to the HRI conference a few months ago. I think there is really something there that we could do very interesting research on.

There are different strategies right now that people are using, other sort of kinesthetic teaching, sort of puppeteering, teleoperation with gloves, or trying to synthesize behaviors in simulation. It's really, I think, a very empirical question of how do you maximize the throughput of

The one thing that I would love to see take off is third-party imitation, being able to learn from watching videos of people doing stuff. But that right now, I don't think anybody has really cracked.

And I think it goes back to the same question of the world model that we were talking about before. It is about inferring causality and how if I do this, then that happens from observation, being able to model that and turn that into a useful learning signal for robots to learn how to behave. So...

One thing on the data front that has been a big accelerator is the fact that we now have the big multimodal models and that transferring visual information from those multimodal models to robots actually does work. We had the example of showing a robot that is moving a Coke can to a picture of Taylor Swift

we've never taught the robot who Taylor Swift was, we've never had to act on the entity Taylor Swift or show the robot any data about Taylor Swift. That knowledge was part of the big multimodal model. So that solves one of the big bottlenecks to data acquisition. Now you really have to think about

how do you acquire the right kind of data that is the motion data? It's about the actuation and the actual sort of physical skills. And I think the jury is still out about what's the right way of doing that. Yeah. I mean, do you think to get causality into these models, it requires a new architecture? Or is like, how do you think we ultimately... I'm sure there's a bunch of research paths being tried toward it. It's a great question. I think it may just be proper data engineering. Okay.

Because thus far, at least in the language model space, we've seen that you can elicit some form of causal reasoning and chain of thought and things like that without having to engineer it into the model.

But having the right data was very important and the right inductive biases in some ways. So it's possible that we'll get there without any major infrastructure or theory changes.

But yeah, I don't know. My hope is that it's really a matter of scaling and data curation. As you think about these unanswered questions right now in robotics and the ones that we'll know the answers to in the next two, three years, what do you feel like? I mean, it sounds like this question of causality and getting that into models is certainly one of them. What are the set of other questions that you feel like will be the key determinants of where the space goes over the next few years?

Can we generalize motion? Can we generalize in the space of actions in the same way that we generalize in the space of perception? I think it's a key question that we cannot afford not to have an answer to. I think one of the big questions is going to be what are the differences

if there are differences between robotics and all of the other areas of AI. Right now, the hypothesis that robotics is just another way, another language of AI seems to hold.

at what level does it break? What is the one thing that we have to say, okay, now that's very different and we have to invent some new techniques for this. We thought that we would need, for example, to invent new techniques for motion generation and it turns out the diffusion models, the same that are being used in video generation, work great for this use case. And right now are kind of state of the art for this kind of thing. So,

was another area where we thought we were special in a sense and we were not. And I wonder if there is other areas like this where you need to have this specialization. And I feel like also it would be super interesting to see if we find early signs of any sorts of scaling laws towards some of these things. Yeah, so back at

at Waymo, we've been looking a lot at scaling laws for large models for behavior and for perception. And we're seeing that the same laws apply to some degree with different constants, right? So you don't...

autonomous driving model doesn't behave with the same characteristics as a LLM in terms of scale and everything, but you see the same kind of linear or log-linear growth in terms of data and size and things like this. So far, all points in the direction of

It's similar. Conceptually, it's the same. But we'll see if it hits the limit. Before you were in the robotics world, you were in the computer vision world, right? And I think...

Certainly, like a lot of VLMs, you know, the advances in those are obviously powering a lot of cool products and a big part of robotics. But it feels like in general, there's kind of all these classic computer vision use cases from back in the day. And it's not like, it doesn't feel like there's been any kind of crazy inflection in the usage of some of those. We haven't had like a chat GPT moment almost in that space. Like, I guess I'm curious, like what, you know, any reflections on that? Yeah, it's interesting, right? The...

Computer vision has kind of been largely driven by the problem or the solution, not the problem, right? In the sense that there hasn't been nearly as many applications of computer vision as the level of interest in the technology may be warranted.

But I think that's really because vision is only useful when you're trying to act on the world. And the fact that we didn't have physical actors meant that just using vision as a tool to observe sort of limited the scope of applications. Also, the technology...

has been working extremely well in close-set environments. If you're trying to use computer vision at large in an open-set environment where you're trying to just parse everything in the world,

The level of performance is good on academic benchmarks. In the real world, it's not nearly as good, right? So there is either a performance bar for sort of open set vision problems. For closed set problems, it's really that the applications of that are very specific.

and I think can be unlocked really with robotics. I think that's really the best use case. You've obviously been on the cutting edge of a lot of AI research for a while. I'm curious, what's one thing you've changed your mind on in the last year? I mean, what has been...

really fascinating to me is to observe the reasoning capabilities and how they evolve and how this funny idea of chain of thought thinking that started as a funny realization in some ways and "haha, this kind of weird prompting does something, isn't that funny?"

actually sort of changed the way people think about reasoning and the path to getting a sort of performance in reasoning. I'll give you an example. I like to write...

little stories and fiction as a hobby. And I had this idea maybe like 10 years ago for a topic in a premise for a science fiction story. Okay. And I sat on it for 10 years because I couldn't figure out if the physics worked.

I couldn't figure out if the premise behind the idea, the physics it was resting on were sound. And I didn't know who to ask. I didn't know where to find the answer to that question.

A couple of weeks ago I put it in Gemini and just used the Gemini deep research to ask simply the question. It gave me a three-page summary of all the equations that were relevant and the whole answer to my question was there in like five minutes.

And so to me, that's something where suddenly you have access to the best physics knowledge or the best legal knowledge in a way that is at your fingertip. And I have been asking myself, what are the millions of things I should be asking Gemini right now that I'm not even thinking I should be asking because I don't have this mental model yet that this is something that is accessible to me. Yeah.

I completely underestimated, I think, the power of having that level of accessibility and what it may mean for everyday use. And it's not just technology. It's more, are there many use cases for those things that we're not even thinking about? Or we haven't internalized, or we haven't made a habit of it. Research is a super powerful product. And I guess, did the physics work in your story?

Yes. So I have to write it now. Are you going to go forward with it? Good writing assistant tools now. It's super interesting. Obviously, I feel like this wave of test time compute has been incredibly interesting and powerful. Obviously, the biggest question I feel like right now in the space is just, you know, it obviously works really well in easily verifiable domains. You know, how far does it get us to superhuman performance on coding and math with like zero impact on other things or...

How do you kind of think about that right now? You know, the extent to which it will be broadly applicable? Yeah, it's a good question. I think the space of problems that are, you know, hard to generate for but easy to verify is pretty broad, right? There is a lot of

things that have that shape where, uh, coming up with a hypothesis or coming up with a plausible solution is hard, but once you have that solution, you can verify it relatively easily, or you can, you don't need to necessarily re-verify it, uh, exactly. You can convince yourself that it's right or that it's in the vicinity of right. Um, and so that's, in general, we've seen that, uh,

generative versus discriminative. In reinforcement learning terms, you have the actor versus critic model pop up everywhere, right? In a lot of places. You turn the hard problem into

of generating a plausible answer into the other hard problem, possibly, of verifying that answer. But it's a lot easier to be in that other world of verifying an answer because you have all the ingredients there. You don't have to sort of imagine them. So I think...

It's going to evolve in such a way that it's not going to be just math, it's not going to be just coding. You could imagine in the autonomous driving case, we can verify that a plan is meeting all our requirements a lot more easily than actually generating the plan in the first place.

because we have hard constraints that we can apply to the problem. So I think there's going to be a ton of different applications that will be able to leverage this. What other areas that maybe are under-talked about today do you think these models will actually be quite effective in? With reasoning in general? Yeah.

I think anything that is multi-step and requires essentially credit attribution is at stake. I think, to me, it's RL done right. I always had this love-hate relationship with RL in the sense that in the early days of robotics, a lot of people were banking on RL being the

the ultimate solution to everything, mostly buoyed by, you know, the success of AlphaGo and everything. And we spent years and years just focusing on RL and not

did not make a ton of progress as a result because we were essentially, I think as a community, we were really focused on trying to learn everything from scratch using RL when it, you know, in hindsight, you know, there's very good ways of just using supervised learning to bootstrap yourself. And maybe you have RL as the little, you know, fine tuning on top. So I think that's,

that paradigm is here to stay in a sense like bootstrap yourself with a large model do a lot of supervised learning and then use RL as the way to make that model even more of an expert at some specific reasoning things that's going to be you know the that has staying power it feels it's the right way of thinking about RL kind of similar to the question I asked you about in robotics like

Over the next few years, I'm sure we'll flip over a lot of cards about how the broader LLM space is going to play out. What two to three questions are most top of mind for you right now that we'll learn more on in the next 12, 24 months? I want to see this world model kind of thrust that I think a lot of people are starting to seriously look into where that lands. I think it's having controllable

video generation, controllable world generation, essentially having a purely generative video games, for example, will teach us a lot about if we can make that work. I think the current architectures of

large multimodal models that we have are going to be here to stay. If we cannot turn those models into good world models, then maybe there is another sort of leap that needs to be done in terms of architectures and performance. So I'm excited about that direction. I think there's a lot of important work that can be done there. Once you have models that can essentially act as

twins of anything, you can turn every computer into a generative model. And so it might be that it will be completely impractical from a compute perspective and require sort of massive investments. I think that's why you see a lot of massive investments going into compute as well right now is that a lot of people start seeing that

The things that we really want to do as the next step are going to be yet another notch up in terms of the demands in computing. Well, it's been a fascinating conversation. We always like to end our interviews with a quick fire round where we get your thoughts on some standard set of questions. And so maybe to kick it off, what's one thing that's overhyped and one thing that's underhyped in the AI world today? Overhyped. Underhyped.

I'm struggling here because I think there is a lot of superficial hype that hides really deep things. So everything I come up as a potential hype, like example, humanoid robotics. A lot of people see this as...

extremely overhyped in the sense that there is a lot of investments that are going in that direction that are not justified by the current capabilities. If we manage to make humanoid robotics work in the next few years, the investment will be entirely justified. The risk is

If we don't succeed and people lose patience, we are headed for a humanoid winter. And it's going to have a negative impact, I think, on all of robotics. So essentially, they're both overhyped and underhyped in the sense that if you're working in robotics today, I think you should be working also in humanoid robotics because we can't afford not to make them work.

So there is a, yeah, it's a tension there. It's all a matter of timelines and where do you think the timelines are going to align in terms of technology development versus amount of spend and focus of the ecosystem. Do you think LLM model progress will be more, less, or the same this year as it was last year? I think it's going to be more. How about robotics models?

more as well. The next set of questions is just an unfair set of prediction questions that are probably overly precise. But what year do you think self-driving car rides will exceed human drivers in the US? I would love for when I'm an old grandpa and to be able to talk to my grandkids and tell them, you know,

You know, in my day, we used to drive cars by hand. Can you believe this? Like, isn't that crazy? Right? Like, I feel like there is a potential future in which

Look back at today and think, man, we were crazy to leave cars in the hands of humans, given the level of accidents that this generates and the complexity of the problem. So,

That's a future I would love to see. Whether it will happen in my lifetime, I don't know. I like to think in the future you'll have to go to the rural countryside and then maybe there they'll let you get behind the wheel of a car when no one else is around. Yeah, yeah, yeah. No traffic whatsoever. What's the go-to thing that you try whenever a new model comes out to experiment with it?

When a new model comes out, I don't try to follow too closely because there is so many models that come up all the time.

I often, my reflexive reaction is to go to the LMS leaderboard and see where it goes, where it stands, and convince myself whether there's something... So you're a metrics, not a vibes guy. Should I be paying attention or not? Yeah, I'm more of a metrics. It's very easy to fool yourself into thinking something or another. And I find that, you know,

when you go to a model and you ask for a hypothetical question, you get a very different answer than when you're trying to actually use the model for real, for a real, like, functional application. So I try to really focus on, uh, is this something, is it helping me in my life? Is it helping me in my business? That makes sense. Um, what year do you think most Americans will have a robot in their house? So we have robots, right? We have, uh, dishwashers, we have, uh, laundry machines, uh,

They just don't look very much like robots. Very fair. So if you're thinking about, you know, sort of like Rosie the robot... I want that coffee cup demo, like, you know, in my house. The mobile manipulator. I think it's going to take a long time. And the reason is, right, anything in your house needs to justify its square footage, right? It needs to be worth being there in the first place. And also...

If a robot is in my house today and makes a little nick on my wall,

I am very certain that this robot will be in the trash, you know, within half a second, right? The level of... A robot has to be so good and so safe for me, even as a robotics enthusiast, to accept it in my house. That's why the only robots that have really succeeded in the home environment to date are the Roombas. Because they're only sort of, you know,

hitting the part of the wall that it's okay to hit. And so I think the bar is going to be extremely high for anything that's mobile and that has access to that can manipulate your environment. It may be less of a lift for a fixed robot that would just be like a workstation where you just bring it

and it does things for you like you force your laundry or things in the laundry machine and things like that. But with an arm that's not long enough to nick a wall.

Yes. So I think it's going to take time. And I think we're going to see a lot more applications in logistics and industrial. I like the near home space. I think there is a lot of potential applications to last meter delivery, things like that, that could really come to fruition much sooner. Office environments, hospital environments, whenever there is...

There are people, it's complex, but there is scale. And then it's someone's job to repaint the wall when there is a nick. That feels a lot more accessible to mobile robots. Do you have any predictions on the implications of all this AI progress on the future that you feel are under-talked about right now? Like how it might change our world or the way we go about our day-to-day lives? It's going to change education. And I don't think we have...

an understanding yet of how it's going to change education. I think a lot of the narrative around education is, oh, you can use chat GPT to cheat. And so, you know, how are you going to be able to evaluate students and things like this? And that completely ignores the bigger conversation of how do we, this is a magical tool to learn things. It's interactive. I, I,

put my kid in front of a conversation the other day and I don't remember what the topic was but we interactively learned about a topic by just having a live conversation with the agent and it was engaging and it was fast it was memorable there's a ton that we can do there and it's right now I feel like

I don't see that in the public sphere being discussed. Yeah. You have a pretty popular course on Udacity. Are you going to take a second swing at it with some of these new tools? That's a good question. My class is long obsolete now. It was in the early days of TensorFlow. It was a ton of fun to put together at the time. Yeah.

do you need me now? Or those tools, I mean, those tools would do a much better job than I do at explaining things.

and at forming the right curriculum. I don't know. I encourage our listeners to check out your writing. You are, you are very, uh, I feel like quite a, quite a clear thinker on a lot of these things and still, still better than the, uh, than, than the average of the entire internet. I try to be the anti-slop. Yes. Um, any area of like AI startups or research that we haven't talked about yet that you think is like particularly exciting or interesting? I'm excited about cheese. Cheese? Uh,

I did not see that coming at all, I gotta be honest. What about cheese? It's a very French answer, I guess. Yes, yes. I've been, one startup I've been talking to recently is designing sort of plant-based cheese products.

as opposed to not using any milk, so building basically a casein based on plant products in a way that's more cheap and sustainable and things like that. I think it's really cool to be able to use AI techniques to design new products

that are more like the day-to-day kind of products that can have a huge amount of impact on the world. I like this kind of... They're using AI to explore the kind of design space for non-animal-based cheese? Yes, yes. And I think it's exactly the kind of thing that is a little bit out of left field, but can have a massive impact just based on the scale of animal farming and milk production and how that can...

the world at large. I want to find the next, you know, the next platform

AI-based cheese startup. I think this is exactly... Have they made the cheese yet? Yeah. Is it good? They have a fantastic blue cheese that is indistinguishable from the cow-based thing. It's served in top restaurants in the city. Wow. I have to be honest, this is the best answer. Usually we get like, oh, I like perplexity or I like deep research or something.

AI cheese is definitely going to set the new high mark for this question. Yeah, I think AI plus something that you don't think about when you think about technology is really where I think a lot of the exciting things will happen.

and probably making those connections and enabling sort of people who are not necessarily, you know, in the technology world to have access to those tools and vice versa is really, I don't know, it's fun. Well, this has been a fascinating conversation. I'm sure folks will want to pull on all sorts of threads. And so I'd love to leave the last word to you. Where can folks go to learn more about you, about Waymo, anywhere you want to point folks? The mic is yours. I,

well I when I when I have the inspiration I post I have a blog on medium where I post random thoughts about machine learning and that's where I that's my that's my little water cooler to with the rest of the world well thanks so much this has been an amazing conversation thanks

Thank you.

Thank you for listening and see you next episode.

Ep 56: Distinguished Engineer at Waymo Vincent Vanhoucke Unpacks the Breakthroughs and Bottlenecks of Self-Driving 01:13:01 Share

Unsupervised Learning

Deep Dive

Shownotes Transcript

Ep 56: Distinguished Engineer at Waymo Vincent Vanhoucke Unpacks the Breakthroughs and Bottlenecks of Self-Driving