Tesla's Road Ahead: The Bitter Lesson in Robotics

2024/10/24

a16z Podcast

AI Deep Dive AI Chapters Transcript

People

Anjney Midha

Erin Price-Wright

主持SF Tech Week关于自治技术的专家讨论-panel

主

主持人

专注于电动车和能源领域的播客主持人和内容创作者。

Topics

主持人：本期节目探讨了特斯拉在机器人和自动驾驶领域的最新进展，以及这些进展对硬件和软件结合的意义。 Anjney Midha：特斯拉的长期投入和技术选择体现了‘苦涩教训’原则，即通用方法优于特定方法。特斯拉通过增加数据量而非传感器数量来提升精度，并通过全栈式整合降低成本。 Erin Price-Wright：特斯拉的技术选择引发了两种截然不同的反应，一部分人认为其技术不成熟，另一部分人则认为其展现了持续的创新。特斯拉采用端到端深度学习方法，这在过去18-24个月才成为可能。远程操控技术在生产领域具有巨大的应用潜力，即使在完全自主机器人出现之前。软硬件结合领域人才匮乏，需要更多掌握软硬件结合技能的人才。 Anjney Midha：特斯拉在自动驾驶领域的长期投入最终带来了成果，展现了其远见。特斯拉的分布式计算网络需要解决成本和可靠性问题。实现跨硬件平台的通用模型是机器人领域的关键挑战，需要高质量的数据支持。早期融合基础模型能够提高模型的泛化能力。特斯拉通过用户测试收集数据，这是一种高效的数据收集方式。解决数据瓶颈问题是推动自主技术发展的重要方向。 Erin Price-Wright：特斯拉的‘We Robot’活动引发了两种截然不同的反应：一部分人认为其技术不成熟，另一部分人则认为其展现了持续的创新。特斯拉的技术选择体现了‘苦涩教训’原则，即通用方法优于特定方法。在自动驾驶领域，通用方法（如深度学习）与特定方法（如针对特定任务的算法）的争论持续已久。马斯克的长期愿景基于深度学习方法的长期优势，即使短期内可能存在时间偏差。特斯拉的成功是其长期愿景实现的必然结果。虽然消费市场是长期最大的市场，但其他领域在短期内也存在显著的应用机会。在危险或难以到达的工作环境中，自主技术具有巨大的应用潜力。自主系统架构包含感知、定位与地图构建、规划与协调、控制四个层次。AI的进步推动了自主系统架构各个层次的发展。软硬件结合的挑战在于硬件的商品化程度、软件与硬件开发周期的协调、以及安全性的要求。与纯软件相比，软硬件结合的系统需要更高的可靠性和安全性。主持人：大众对特斯拉人形机器人的评价褒贬不一，有人质疑其技术细节，有人则对其未来潜力表示期待。软件与物理世界的融合正处于发展初期，需要更多掌握软硬件结合技能的人才。与纯软件相比，软硬件结合的挑战更大。在能源、制造业、供应链和国防等领域，自主技术具有巨大的发展潜力。机器人训练数据来源多样化，包括视频数据、模拟数据和各种生成的现实数据。特斯拉利用其工厂和车队收集数据，具有独特的优势。数据共享和合作模式正在兴起，以解决数据瓶颈问题。

Deep Dive

Chapters

This chapter explores Tesla's "We, Robot" event, focusing on the unveiling of the Cybercab, RoboVan, and Optimus. It discusses the significance of these announcements in the context of Rich Sutton's "Bitter Lesson" and the intersection of hardware and software in AI development. The debate surrounding the realism of these projects and their potential impact is also addressed.

Tesla's "We, Robot" event showcased the Cybercab, RoboVan, and Optimus.
The event sparked debate about vaporware vs. real progress.
Discussion on Rich Sutton's "Bitter Lesson" and its relevance to Tesla's approach.
The chapter highlights the intersection of hardware and software in AI development.

Shownotes Transcript

Translations:

中文

This is the big race in robotics. The smarter your brain, so to speak, the less specialized your appendages have to be. AI has pushed every single one of these kind of to its limit and to a new state of the art. The way they're solving precision is instead of throwing more sensors on the car, is to basically throw more data at the problem. Data is absolutely eating the world. What is good enough? We used to have the Turing test, which obviously we're blown past now.

His shorthand for it was like the AWS of AI. He's got this idea of this distributed swarm of unutilized inference computers. Whether that's an oil rig, whether that's a mine, whether that's a battlefield, there's so many different use cases for a lot of this underlying technology that are really starting to see the light of day. It's basically an if, not a when. An inevitability.

Earlier this month, Elon Musk and the team at Tesla held their We Robot event, where they unveiled their plans for the unsupervised full self-driving cyber cab and robo van. Plus Optimus, their answer to consumer grade humanoid robots and also what Musk himself predicted would be, quote, the biggest product ever of any kind.

Now, of course, none of these products are on the market yet, but several demos were on show at the event. Naturally, the response was mixed. Supporters said we got a glimpse of the future, while critics said the details were missing. But in today's episode, we're not here to debate that. What we do want to talk about is what this indicates about the intersection of where hardware and software meet.

So what does Rich Sutton's 2019 blog post, The Bitter Lesson, tell us about the decisions that Tesla's making in autonomy? And how realistic is the quoted $30,000 price range? Also, what are the different layers of the autonomy stack? And where do we get the data to power it? And what does any of this look like when you exit the consumer sphere? We cover all this and more with A16C partners, Anjani Mehta and Aaron Price-Wright.

Anjane previously founded Ubiquity6, a pioneering computer vision and multiplayer technology company that sat right at this intersection of hardware and software, and was eventually acquired by Discord. Aaron, on the other hand, invests on our American Dynamism team with a focus on AI for the physical world. And if you'd like to dig even deeper here, Aaron has penned several articles on the topic that we've linked in our show notes. All right, let's get to it.

As a reminder, the content here is for informational purposes only, should not be taken as legal, business, tax, or investment advice, or be used to evaluate any investment or security, and is not directed at any investors or potential investors in any A16Z fund. Please note that A16Z and its affiliates may also maintain investments in the companies discussed in this podcast. For more details, including a link to our investments, please see as16z.com slash disclosures.

So last week, Tesla had their We Robot event and Musk announced the CyberCab, the RoboVan, or as he liked to call it, the RoboVan and the Optimus.

You guys are so immersed in this hardware software world. I'd love to just get your initial reaction. From my perspective, it wasn't that there was anything in particular that was super surprising, but what was exciting as just sort of a culmination of one thing that Elon Musk does really well and Tesla has done really well, which is continue to

pour love and energy and money and time into a dream and a vision that's been going on for a really long time, like well past when most financial investors and most people kind of lost the luster of self-driving cars after their initial craze in the mid to late 2010s. And they've just continued to plot along and to continue to make developments. And now we're finally seeing like this glimpse of the future for the first time in a really long time.

I think that's right. I think it was very impressive, but unsurprising. Yeah. So I think the two schools of thought when people watched the event was one was absolutely this whole, oh, my God, this is such vaporware. He shared literally nothing on engineering details. What the hell? Come on, give us the meat on timelines and dates and prices. And then the opposing view was like, holy shit.

They're still going. They haven't given up on any of this autonomy stuff that he's been talking about for years. And I'm absolutely more empathetic towards the view that it was a lean towards the latter, which is that it's sort of as an homage to the bitter lesson. It's sort of amazing blog post that I'm going to do a terrible job of summarizing by this great computer scientist, Rick Sutton, which basically says that over the last 70 years or so of computer science history, what we've learned is that general purpose methods basically beat out any specific methods in artificial intelligence in particular.

Basically, the idea that if you're working on solving a task that requires intelligence, you're usually better off leveraging Moore's law and more compute and more data than trying to hand engineer a technique or set of algorithms to solve a particular task. And broadly speaking, that's been the big grand debate in self-driving and autonomy, I would say, for the last two decades, right? It's the sort of general purpose, better or less in school versus the let's model self-driving as a specific task.

as a set of discrete decision-making algorithms unconnected to each other. A system to solve, let's say, edge detection around stop signs, right? Where self-driving is a really hard problem. And you could totally say, well, there's so many edge cases in the world that we should map out each of those edge cases. And I think it was an homage to the better lesson. So that's what I was most excited about is he did share actually details that their pipeline is basically an end-to-end deep learning approach. Which is incredible and probably true only for the last, my guess is...

18 to 24 months. Right. Yeah. Yeah. And I mean, in The Bitter Lesson, he also talks about the fact that it's really appealing to do the opposite because in the short term, you will get the benefit, but the broader deep learning approach ends up winning out in the long term. And a lot of people talk about Musk. Musk says it about himself that the timelines sometimes are off, but he's basically banking on that premise in the long term. It's basically an

If not a win. In inevitability. Right.

the first time that this version of the future felt like an inevitability. And before we get into maybe the specifics around where else hardware and software are intersecting, I'd love to just talk about that, that average person who's watching, because you guys are meeting with companies and investors, and this has been going on for quite some time. So I'm just curious if maybe you noticed anything under the hood or maybe the meta in that

announcement or event that maybe the average person watching is, you know, what are they seeing? They're saying things like, oh, maybe it was human controlled and not like fully AI and device. Or other people are commenting on the fact that these humanoids are shaped like humans. Like, why do we need that?

On the topic of humanoids, I think humanoids are a great choice of embodiment for a robot to really emotionally connect to and speak to a human being watching because I can relate to a human form factor. Obviously, we found out that it was teleoperated, which is, in my opinion, still doesn't take away from like how cool and amazing it was. The human form factor is a way to connect to

What is happening with robotics to a regular person who is like, OK, yes, I like see myself in that. This looks like Star Wars or some other sci fi movie. In reality, maybe this is like a controversial opinion. I don't see the vast majority of economic impact over the next decade from robotics coming from the humanoid form factor.

But that doesn't take away from the power of the symbol of having a humanoid make a drink at this event because it just like connects back to the sort of science fiction promise of our childhoods getting sort of finally delivered.

The opening sequence, he started with like a sci-fi, I think it was a Blade Runner visual. And he was like, we all love sci-fi and I want to be wearing that jacket that he's wearing in the picture, but we don't want any of the other dystopian stuff. And so that definitely stuck out to me is that he did not start the way he usually does. It's often a technical first sort of story, but he started with a, here's a vision for where I think the world should go. So it was much more Disney-esque in that. And it was quite poetic. I think they literally did it on the

Warner Brothers Studio lot. And so they like recreated a bunch of cities. And I think they had on site of the event, the robo vans taking people around from these simulated cities. There was a sort of theatricality to it all that stuck out to me, which I thought was quite different. And I thought it was refreshing because the core problem with this branch of AI, which is largely deep learning based and better lesson based, is that it's an empirical field, unlike call it Moore's law, which was prehistoric.

predictive, where you basically know if you double the number of transistors, you get this much more performance on the chip. And it's just about pure execution. AI is much more empirical. You don't really know when the model is going to get done training and when it does get training, whether it will converge or not. Or

Or even what does converged mean? Like what is good enough? We used to have the Turing test, which obviously we're blown past now. It's a feeling more than it is a set of discrete metrics that you can really point to. Right. So it made a lot of sense to me that he's trying to decouple this idea of progress from a specific timeline.

I see.

It's this incremental but predictable, I think, forecasting that the tech industry keeps trying to reward. And I think what he's doing is pretty refreshing, which is saying, look, here's a vision for where we want to go, but it's decoupled. The second thing on the humanoid piece that I was quite impressed by is actually

the quality of the teleoperation. So everybody's talking about how, oh, this is fake. This is all smoke and mirrors. It's just people. Teleoperation is really hard. I was going to say, why is no one talking about that? Have you ever tried? I mean, I've tried. It's so hard. We were at a company two weeks ago and they've got these teleop robots and the founder was demoing a mechanical arm that he was teleoperating with a gamepad and he was folding clothes with it. And

And I was like, oh, that looks simple. He's like, here, try it. It was one of the hardest manipulation things I've ever tried. And by the way, we tried that with, you know, VR headset with six DoF motion controllers, almost harder to do. Teleoperating something, especially over the internet in a smooth fashion with precision is incredibly hard. And I don't think people appreciate the degree to which they've really solved that pipeline. Yeah, I was actually really impressed by that. And, you know, I think that there's...

huge opportunity for tele-op in sort of production applications that will have like massive economic benefit even before we have true robots running around managing themselves. Because if you think about there's all these really hard and really dangerous or hard to get to jobs or there's labor differentials where it's a lot harder to hire people to do certain things in certain locations. And if we can imagine a future where the tele-op that we saw last week at the event is something that's like widely available,

Like, that's incredible. Imagine not having to go and service a power line, but you can actually tele-op a robot to do that for you, but still have the level of sort of human training and precision needed to make a really detailed and specific evaluation. The promise of that is really cool, even before we get to robots. So that was really exciting. Yeah, it's like a stop along this journey. And so if we talk about that journey, the arc of hardware and software coming together in maybe a different way than we've seen in the past, just

Just as an example, so Mark famously said software is eating the world. That was in 2011. We're in 2024. And it does feel like the last decade has been a lot of traditional software, not so much integrating with the physical world around us. And so where would you place us in that trajectory? Because we're seeing it with autonomous vehicles, but I get the sense that's not the only place where this is happening. Yeah, this is where I spend 95% of my time and all of these industries that are just starting to see change.

the glimmers of what autonomy and sort of software-driven hardware can bring. What's really interesting is just actually a dearth of skills of people who know how to deal with hardware and software together. You have a lot of people that went and got computer science degrees over the last decade, and relatively speaking, a lot fewer that went and got electrical engineering or mechanical engineering degrees. And we're starting to see the rise of, "Oh, shoot, we actually need people who understand not just maybe how the software works.

in the cloud with Wi-Fi, where you have unlimited access to compute and you can retry things as many times as you want and you can ship code releases all day every day. But you actually have kind of a hardware deployment where you have limited compute in an environment where you maybe can't rely on Wi-Fi all the time, where you have to tie your software timelines to your hardware production timelines.

Like these are a really difficult set of challenges to solve. And right now, like there just isn't a lot of standardized tooling for developers and how to do that. So it's interesting. We're starting to see portfolio companies of ours across really different industries that are trying to use autonomy, whether it's oil and gas or water treatment or HVAC or defense companies.

They're like sharing random libraries that they wrote to connect to like particular sensor types because there's not this like rich ecosystem of tooling that exists for the software world. So we're really excited about what we're starting to see emerge in the space. Even Elon said when he's talking about these two different products that he's unveiling, right? Optimus and then you have the robo vans or cyber cabs and

Those seem like two completely different things. But he even said in the announcement, he said, everything we've developed for our cars, the batteries, power electronics, advanced motors, gearboxes, AI inference computer, it all applies to both. Right. So you're seeing this overlap. That's super exciting. When I was watching it, I was just nerding out because my last company was a

computer vision, 3D mapping and localization companies. So I unfortunately spent too much of my life calibrating LiDAR sensors to our computer vision sensors. Because our whole thesis when I started back in 2017 was that you could do really precise positioning just off of computer vision and that you didn't need fancy hardware like LiDARs or depth sensors. And to be honest, not a lot of people thought that we could pull it off. And frankly, I think there were moments when I doubted that too. And so it was just really fantastic to see that Hizbett

And the companies bet on computer visions and a bunch of these sensor fusion techniques that would not need specialized hardware, would ultimately be able to solve a lot of the hard navigation problems, which basically means that the way they're solving precision...

is instead of throwing more sensors on the car, is to basically throw more data at the problem. And so in that sense, data is absolutely eating the world. And you asked where on the trajectory are we of software eating the world? And I think we're definitely on an exponential that has felt like a series of stacked sigmoids.

Often it feels like you're on a plateau, but a series of plateaus totally make up an exponential if you zoom out enough. And earlier in the conversation, we talked about the bitter lesson. A number of other teams in the autonomy space decided to tackle it as a hardware problem, not a software problem, right? Where they said, well, more LiDAR, more expensive LiDAR, more GPUs, more sensors, right? And Elon's like, you know, actually I want cheap cars.

that just have computer vision sensors. And what I'm going to do is use a bunch of the custom, really expensive sensors that many other companies put on the car, which is at inference time. And he's just going to use them at train time. So Tesla does have a bunch of like really custom hardware

That's not scalable. That drives around the world in their parking lots and simulation environments and so on. And then they distill the models they train on that custom hardware to a test time package. And then they send that test time package to their retail cars, which just have computer vision sensors. And the reality is that's a raw arbitrage, right, between sensor stack. And it allows the hardware out in the world to be super cheap.

The result there is software is eating the sensor stack out in the world that makes the cost of these cars so much cheaper that you can have a $30,000 fully autonomous car versus $100,000 plus of cars that are fully loaded with these LiDAR sensors and so on. But I think in order to have the intuition that you can even do that, you really actually have to understand hardware. If you just understand software...

And hardware is like a sort of a scary monster that lives over here. And maybe you have a special hardware team that does it. It's going to be hard for you to have the confidence to say, no, we can do it this way. I think you're totally right, which is that the superpower that Tesla has is his ability to go full stack.

Because a lot of other industries often segment out software versus hardware, like you're saying. And that means that people working on algorithms and the autonomy part just treat hardware as like an abstraction, right?

You throw over a spec, it's an API, it's an interface that I program against and I have no idea what's going on. You don't have to worry about the details. Don't have to worry about it. It doesn't matter. Which, by the way, is super powerful. It's unlocked this whole general purpose wave of models like ChatGPT and so on, right? Because it allows people who specialize in software to not have to think about the hardware. It's also what's driven sort of the software renaissance of the last 15 years. Absolutely. Decoupling, right? Composition and abstraction is sort of the fundamental basis of the entire computing revolution. But I think...

When you're like him and you're trying to bring a new device to market, kind of like what Jobs did with the iPhone, by going full stack, you end up unlocking massive efficiencies of cost. And I think that's what this event may have been lost in the sort of theatricality of it all is the fact that he's able to deliver an autonomous device.

to retail consumers at a cost profile to vertical integration that would just not be possible if it was just a software team buying hardware from somebody else and building on top. Can we talk about those economics, by the way, just attacking that head on? Both Optimus and CyberCab were quoted as being under the 30K range. Is that really realistic? And then tied into what you were saying, we see other autonomous vehicles, which are betting more on the LiDAR and the sensors, which also have come down in price pretty substantially.

My guess is Elon is backing into the cost based on what people are willing to pay. And he will do whatever it takes to get those costs to line up. I mean, it's the same thing he did with SpaceX. He will operate within whatever cost constraints he needs to operate within, even if the rest of the market or the research community is telling him it's not possible. Obviously, like a 30K humanoid robot is...

is way less than what most production industrial robotic arms cost today, which I think are more in the 100K range for the ones that are used in like the high-end factories. So if you can get it down to 30K, that's really exciting. I also don't necessarily think you need even a 30K humanoid robot to accomplish a wide swath of the automation tasks that would pretty radically transform the way our economy functions today.

Yeah, I think Aaron's right in that there's probably a top-down director who just do whatever it takes to get into the cost footprint. This car has to cost $30K. Right. But I think if you do a bottoms-up analysis, I don't think you end up too far because actually, if you just break down the kind of bomb on a Tesla Model 3...

You're not dramatically far off from the sensor stack you need to get to a $30,000 car, right? This is the beauty of solving your hardware problems with software is you don't need a $4,000 scanning lighter on the car. So I think on the cyber cab, I feel much more confident that the cost footprint is

going to fall in that range because it's frankly, we kind of have at least an ancestor on the streets, right? The thing that gets prices up is custom sensors because it's really expensive to build custom sensors in short production runs. And so you either have scale of manufacturing like an Apple and you make a new CMOS sensor or a new face ID sensor and you get cost economies of scale because you're shipping more like 30 million devices in your first run. Or you just lean on commodity sensors from the past era and you tackle most of your problems in software. Right.

Which is what he's doing. And to that point, when he's betting on software, another interesting thing that he announced was really over-spec-ing these cars to almost change the economics potentially based on the fact that those cars could be used for distributed computing. To your point, Anj, if you put a bunch of really expensive sensors on the car, you can't really distribute the load of that in any other way than driving the car, right? Right.

If you actually have this computing layer that's, again, in his case, he's saying he's planning to over spec, that actually can fundamentally change what this asset is.

And you kind of saw the same thing even with Tesla's today, where he's talking about this distributed grid, right? Where all of a sudden these large batteries are being used not just for the individual asset. So do you have any thoughts on that idea or if we've seen that elsewhere? He was a bit skimpy on details on that. But I think he did say that the AI5 chip is over specced. It's probably going to be four to five times more powerful than the HW4, which is their current chip. It's going to draw four times more power, which probably puts it in that like 800 watts or so range, which for

context, your average hairdryer is at about 1800 watts. I mean, it's hard to run power on the edge.

But I think what he said was something to the effect of like, your car's not working 24 hours a day. So if you're driving, call it eight hours a day in L.A. traffic. God bless whoever's having to do that. For real. Hopefully they're using self-driving. One would hope. Actually, he opened his pitch with a story about driving to El Segundo and saying you can fall asleep and wake up on the other side. But I think the T-shirt size he gave was about 100 gigawatts of unallocated inference data.

compute just sitting out there in the wild. And I think his shorthand for it was like the AWS of AI, right? He's got this idea of this distributed swarm of unutilized inference computers. And it's a very sexy vision. I really want to believe in it. Ground us on, is this realistic? Well, I know, I think it's realistic for workloads that we don't know yet in the following sense, right? That the magic of AWS is that it's centralized.

And it abstracts away a lot of the complexity of hardware footprints for developers. And by centralizing almost all their data centers in single locations with very cheap power and electricity and cooling, these clouds are able to pass on very cheap inference costs to the developer. Now, what he's got to figure out is how do you compensate for that in a decentralized fashion? And I think we have kind of prototypes of this today. Like there are these vast decentralized clouds. I think one is literally called vast of people's unallocated gaming rigs. People have millions of...

Nvidia 4090 gaming cards sitting on their desks that aren't used. And historically, those have not yet turned into great businesses or high utilized networks because developers are super sensitive to two things, cost and reliability. And by centralizing things, AWS is able to ensure very high uptime and reliability, whereas somebody's GPU sitting on their... Maybe available, maybe they're driving to El Segundo. Right, right. And

And there are just certain things, especially with AI models that are hard to do on highly distributed compute where you actually need good interconnect and you need things to be reasonably close to each other. Maybe in his vision, there's a world where you have optimist robots in every home and somehow your home optimist robot can take advantage of additional compute or additional inference with your Tesla car that's like sitting outside in your driveway. Who knows? Right. Okay. Well, this event clearly was focused on

different models that are consumer facing. So again, CyberCab, that's for someone using an autonomous vehicle. Optimus is a humanoid robot probably in your home. But Aaron, you've actually been looking at the hardware software intersection in a bunch of other spaces, right? And as you alluded to earlier, maybe different applications with better economics, at least today.

I think long term, there's no market bigger than the consumer market. So everyone having a robot in their home and a Tesla car in their driveway that's also a robo-taxi has huge economic value. But that's also a really long term vision. And there's just so much happening in autonomy that's taking advantage of the momentum and the developments that companies like Tesla have put forward into the world over the last decade that actually have the potential to have meaningful outcomes.

impact on our economy in the short term. I think the biggest broad categories for me are largely the sort of dirty and unsexy industries that have very high cost of human labor, often because of safety or location.

access, whether that's an oil rig out in the middle of Oklahoma somewhere that's three hours drive from New York City, whether that's a mine somewhere in rural Wyoming that freezes over for six months out of the year so humans can't live there, and mine, whether that's a battlefield where we're starting to see autonomous vehicles go out and clear bombs and mines from battlefields to protect human life.

There's so many different use cases for a lot of this underlying technology that are really starting to see the light of day. So very excited about that. And as we think about that opportunity, you've also talked about this software-driven autonomy stack. So as you think about the stack, what are the layers? Can you just break that down? Yeah, sure. So across whether it's a self-driving car or sort of an autonomous control system, we're seeing the stack break down into pretty big

similar categories. So first is perception. You have to see the world around you, know what's going on, be able to see if there's a trash can, be able to understand if there's a horizon, if you're a boat.

The second is something Ange knows really well, which is location and mapping. So, okay, what do I see? How do I find out where I am within that world based on what I can see and what other sensors I can detect, whether it's GPS, which often isn't available in battlefields or in warehouses, et cetera. The third is planning and coordination. So that's, okay, how do I take a large task and turn it into a series of smaller tasks? So what is more of an

instant reaction. I don't have to really think about how to take a drink of water, but I might have to think about how to make a glass of lemonade from scratch. So how do I think about compute across those different types of regimes when something is more of an instinct versus when something has to be sort of taken down and processed into discrete operations?

And then the last one is control. So that's like, how does my brain talk to my hand? Like, how do I know what are the nerve endings doing in order to pick up this water bottle and take a drink out of it? And that's a really interesting kind of field that's existed for decades and decades. But for the first time, probably since the 70s, we're starting to see really interesting stuff happen in the space of controls around autonomy and robotics. And I would say like all of these are pre-existing things.

None of this is wildly new, but I think in the last two years, especially with everything that's happening with deep learning, video language models, broadly speaking, AI has pushed every single one of these kind of to its limit and to a new state of the art.

And there just aren't tools that exist to tie all that together. So every single robotics company, every single autonomous vehicle company is basically like rebuilding this entire stack from scratch, which we see as investors as a really interesting opportunity as the ecosystem evolves.

And as you think about that ecosystem, people kind of say that as soon as you touch hardware, you're literally working on hard mode compared to just a software-based business. So what are the unique challenges, even with maybe that AI wave today that's pushing things ahead? How would you break down what becomes so much harder?

I think Ange touched on this a little bit before, but the more you can commoditize the hardware stack, the better. So the most successful hardware companies are the ones that aren't necessarily inventing a brand new sensor, but are just taking stuff off the shelf and putting it together. But still, like tying everything together is really hard. Like when you think about releasing a phone, for example, Apple has a pretty fast shipping cadence and they're still releasing a new phone only every once a year. So you...

have to essentially tie a lot of your software timelines to hardware timelines in a way that doesn't exist when you can just sort of ship whenever you want into clouds.

If you need a new sensor type or you need a different kind of compute construct or you need something fundamentally different in the hardware, you're bound by those timelines. You're bound by your manufacturer's availability. You're bound by how long it takes to quality engineer and test a product. You're bound by supply chains. You're bound by figuring out how these things have to integrate together. So the cycles...

are often just quite a lot slower. And then the other thing is when you're interacting with the physical world, you get into use cases that touch safety in a really different way than we think about with pure software alone. And so you have to design things for a level of like hardiness and reliability

that you don't always have to think about with software by itself. If your chat GPT is a little slow, it's fine. You can just try again. But if you have an autonomous vehicle that's like driving a tank on a battlefield autonomously and something doesn't work, like you're kind of screwed. So you have to have a much higher level of rigor and testing and safety built into your products, which slows down the software cycles. The holy grail is sort of general purpose testing.

indulgence for robotics, which we still don't have.

When you train a general model, you basically get the ability to build hardware systems that don't have to be particularly customized. And that reduces hardware iteration cycles dramatically because you can basically say, look, roughly speaking, these are the four or five commodity sensors you need. The smarter your brain, so to speak, the less specialized your appendages have to be. And I think what a number of really talented teams are trying to solve today is, can you get models to generalize across embodiments, right? Can you train a model that can work

seamlessly on a humanoid form factor or a mechanical arm, a quadruped, whatever it might be. And I'm quite bullish that it will happen. I think the primary challenge there that teams are struggling with today is the lack of really high quality data, right? The big unknown is just how much data, both in quantity and quality, do you really need to get models to be able to reason about the physical world spatially in a way that abstracts across any hardware?

I'm completely convinced that once we unlock that, the applications are absolutely enormous, right? Because it frees up hardware teams, like Aaron was saying, from having to couple their software cycles from hardware cycles. It decouples those two things. And I think that's the holy grail. I think what Tesla, the victory of the autonomy team over there, is having realized eight

eight years ago. Yeah. The efficacy of what we call early fusion foundation models, right? Which is the idea that you take a bunch of sensors at training time and different inputs of vision, depth, you take in video, audio, you take a bunch of different six DoF sensors and you tokenize all of those and you fuse them all at the point of training and

And you build an internal representation of the world for that model. In contrast, the LLM world does what's called late fusion. You often start with a language model that's trained just on language data. And then you duct tape on these other modalities, right? Like image and video and so on. And I think the world is now starting to realize that early fusion is the way forward. But of course, they have an eight-year head start. And so I get really excited when I see teams either tackling the data sort of challenge for general spaceflight.

spatial reasoning or teams that are taking these early fusion foundation model approaches to reasoning that then allow the most talented hardware teams to focus really on what they know best. Where are these companies getting training data from?

Because you mentioned Tesla, for example. Yes, we've had cars on the road, tons of them with these cameras and sensors. I still think that one of the smartest things Elon did was turn on full self-driving for everybody for like a month-long trial period last summer. I have a Tesla and I turned it on for my free month and it was like,

a life-changing experience using, and I obviously couldn't get rid of it. And so now, not only do I now pay for full self-driving, but I also... You're feeding the pipeline. I'm giving him all my data. So to me, that's really clever. And so I'm curious if you talk about some of these other applications, do they have the number of devices, or in this case, cars for Tesla, capturing this data? Or how else are we going to get this spatial data? Yeah.

This is the big race in robotics right now. I think there are several different approaches. Some people are trying to use video data for training. Some people are investing a lot in simulation and creating digital 3D world. And then there's a mad rush for every kind of generated data that you could possibly have. So whether that's robotic teleoperated data, whether that's robotic arms and offices, most of these robotics companies have processes

pretty big outposts where they're collecting data internally. They're giving humanoids to their friends to have in their homes. It's a yes and scenario right now where everyone is just trying to get their hands on data literally however they can. I think it's the Wild West. But if you're Tesla, then the secret weapon you have is you've got your own factories, right? So the Optimus team has a bunch of

walking around the factories, constantly learning about the factory environment. And that gives them this incredible self-fulfilling sort of compounding loop. And then, of course, he's got the Tesla fleet like

Aaron was saying earlier with FSD. I'm proud to have been a month one subscriber for it. And I'm happy that I'm contributing to that training cycle because it makes my Model X smarter next time around. So the challenge then is for companies that don't have their own full stack, sort of fully integrated environment, right? Where they don't have deployments out in the field. And to Aaron's point, you can either take the simulation route for that and say, we're going to create these sort of synthetic

pipelines. Or we're seeing this huge build-out of teleop fleets. Like with language models, you had people all around the world in countries showing up and labeling data. You have teleop fleets of people piloting mechanical arms halfway around the world.

I think there's an interesting sort of third new category of efforts we're tracking, which is crowdsourced coalitions, right? So an example of this is the DeepMind team put out this maybe a year and a half ago, robotics data set called RTX, where they partnered with a bunch of academic labs and said, hey, you send us your data. We've got compute and researchers. We'll train the model on your data and then send it back to you. And what's happening is there's just different labs around the world who have different

robots of different kinds. Some are arms, some are quadrupeds, some are bipeds. And so instead of needing all of those to be centralized in one place, there's a decoupling happening where some people are saying, well, we'll specialize in providing the compute and the research talent, and then you guys bring the data. And then it's a give to get model, right? Which we saw in some cases with the internet early on.

NVIDIA is an example of this where their research team isn't stacking a bunch of robots in-house. So they're instead partnering with people like Pharma Labs who have arms doing pipetting and wet lab experiments and saying, you send us the data. We've got a bunch of GPUs. We've got some talented deep learning folks. We'll train the model, send it back to you. And I think it's an interesting experiment. And there's reason to believe this sort of give to get model might end up actually having the highest diversity of data. But we're definitely in sort of full experimentation land right now. Yeah. And my guess is we'll need all of it.

So it sounds like data is a big gap and it sounds like some builders are working on that. But where would you guys like to see more builders focused in this like hardware software arena, especially because I do think there are some consumer facing areas where people are drawn to they see an event like this and they're like, oh, I want to work on that.

Yeah, I'm pretty excited about the long tail of really unsexy industries that have outsized impact on our GDP and are often really critical industries where people haven't really been building for a while. Things like energy, manufacturing, supply chain, defense. These industries that really carry the U.S. economy, where we have underinvested from a technology perspective probably in the last several decades, are poised to be pretty transformed.

by this sort of hardware software melding in autonomy. I'd love to see more people there. I'm very excited for all the applications Erin talked about. And I think to unlock those, we really need a way to solve this data bottleneck, right? So startups, builders who are figuring out really novel ways to collect that data in the world, get it to researchers, make sense of it, curate it. I think that's sort of a fundamental limiter on progress across all of these industries. We just need to sort of 10x the rate of experimentation in that space.

All right, that is all for today. If you did make it this far, first of all, thank you. We put a lot of thought into each of these episodes, whether it's guests, the calendar Tetris, the cycles with our amazing editor, Tommy, until the music is just right. So if you like what we put together, consider dropping us a line at ratethispodcast.com slash A16Z. And let us know what your favorite episode is. It'll make my day, and I'm sure Tommy's too. We'll catch you on the flip side.

Tesla's Road Ahead: The Bitter Lesson in Robotics 36:46 Share

a16z Podcast

Deep Dive

Shownotes Transcript

Tesla's Road Ahead: The Bitter Lesson in Robotics