We're sunsetting PodQuest on 2025-07-28. Thank you for your support!

How World Foundation Models Will Advance Physical AI With NVIDIA’s Ming-Yu Liu - Episode 240

2025/1/7

The AI Podcast

AI Deep Dive AI Chapters Transcript

People

Mingyu Liu

Topics

Mingyu Liu: 我是NVIDIA的研究副总裁，也是IEEE院士。世界基础模型是基于深度学习的时空视觉模拟器，能够模拟各种环境和行为，帮助我们预测未来并做出更好的决策。它可以用于训练物理AI代理，提高代理在推理过程中的决策能力，并根据文本、图像、视频和动作提示生成虚拟世界。世界基础模型可以根据不同的物理AI设置进行定制，以适应不同的摄像头数量和位置。它与大型语言模型不同，大型语言模型关注文本描述和理解，而世界模型关注模拟，通常以视频形式呈现。世界基础模型对于物理AI开发者至关重要，因为物理AI系统与环境交互并可能造成实际损害。世界模型可以用于模拟，从而降低风险。它可以用于验证物理AI系统的不同检查点，减少在实际环境中部署带来的风险和时间成本。它可以预测未来，并作为策略模型的初始化，减少策略模型训练所需的数据量。它可以在机器人采取行动之前进行模拟，选择最佳方案。目前世界模型的准确性评估方法仍在发展中，需要考虑物理定律、物体持久性等多个方面。NVIDIA的Cosmos平台提供预训练的世界基础模型（基于扩散和自回归）、标记器和微调脚本，帮助开发者定制模型。我们发布Cosmos平台是为了帮助物理AI开发者，因为他们可能缺乏构建世界模型的资源和专业知识。Cosmos平台提供基于扩散和自回归的两种世界基础模型，前者生成质量更高，后者速度更快，以满足不同需求。自回归模型在准确性方面可能不如扩散模型，但在与物理AI系统集成方面更容易。世界基础模型可以作为合成数据生成引擎、策略评估工具和策略初始化工具，并在决策过程中提供数据参考。自动驾驶和人形机器人行业将从世界模型中获益最多。我们与多家公司合作开发世界基础模型，通过了解合作伙伴的需求来改进平台。世界基础模型仍处于早期阶段，未来发展将很快，因为大型模型的基础设施已经建立，并且存在巨大的需求和投资。未来需要研究如何更好地将世界模型集成到物理AI系统中。可以参考NVIDIA网站和白皮书了解更多信息。

Deep Dive

Chapters

This chapter introduces World Foundation Models (WFMs) as deep learning-based space-time visual simulators capable of predicting future events and simulating various environments. It explains how WFMs differ from LLMs and video generation models, focusing on their simulation capabilities for training and decision-making in physical AI.

WFMs are deep learning-based space-time visual simulators.
WFMs simulate physical environments, enabling better AI decision-making.
WFMs differ from LLMs by generating simulations (videos/pixels) instead of text descriptions.

Shownotes Transcript

Translations:

中文

Hello, and welcome to the NVIDIA AI Podcast. I'm your host, Noah Kravitz. NVIDIA CEO Jensen Huang recently keynoted the

the CES Consumer Electronics Show conference in Las Vegas, Nevada. Amongst the many exciting announcements Jensen talked about was NVIDIA Cosmos. Cosmos is a development platform for world foundation models, which I think we're all going to be talking a lot about in the coming months and years. What is a world foundation model? Well,

Thankfully, we've got an expert here to tell us all about it. Mingyu Liu is Vice President of Research at NVIDIA. He's also an IEEE Fellow, and he's here to tell us all about World Foundation Models, how they work, what they mean, and why we should care about them going forward. So without further ado, Mingyu, thank you so much for joining the NVIDIA AI Podcast, and welcome. It's a pleasure to be here. So let's start with the basics, if you would. What is a World Foundation Model? Sure.

So World Foundation models are deep learning based space-time visual simulator that can help us look into the future. It can simulate visits, it can simulate people's intentions and activities. It's like data exchange of AI. Imagine many different environments and can simulate the future. So we can make good decisions based on this simulation.

We can leverage World Foundation models imagination and simulation capability to help train physical AI agents. We can also leverage this capability to help the agent make good decision during the inference time. You can generate a virtual world based on text prompts, image prompts, video prompts, action prompts, and the layer combinations. We call it a World Foundation model because it can generate many different worlds.

and also because it can be customized to different physical AI setups. - Right. - To become a customized world model, right? So different physical AI have different number of cameras in different locations. So we want the World Foundation model to be customizable for different physical AI setups so they can use in their settings. - So I want to ask you kind of how a world model is similar or different to an LLM and other types of models.

But I think first I want to back up a step and ask you, how is a world model similar or different to a model that generates video? Because my understanding, and please correct me when I'm wrong, my understanding is that you can prompt a world model to generate a video.

But that video is generated based on the things you were talking about, based on understanding of physics and other things in the physical world. And it's a different process. So I don't know what the best way is to kind of unpack it for the listeners. But one place to start might be, how does a world model differentiate from an LLM or a generative AI video model? So...

World model is different to LN in the sense that LN is focused on generating text description. It generates understanding. And world model is generating simulation. And the most common form of simulation is videos. So they are generating pixels. And so world models and video foundation models, they are related.

Video Foundation Model is a general model that generates videos. It can be for creative use cases, it can be for other use cases. In world models, we are focusing on this aspect of video generation. Based on your current observation and the intention of

of the actors in your world, you roll out the future. Right. Yeah. So they are related, but with a different focus. Gotcha. Thank you. So why do we need the world models? I mean, I think I know part of the answer to the question. We're talking about simulating physical AI and all of these amazing things. But what's the, you know, tell us about the need for world foundation models from your perspective.

So I think world foundation models is important to physical AI developers. You know, physical AI are systems with AI deployed in the real world

And different to digital AI, these physical AI systems, they interact with the environment and create damage. So this could be real harm. Right. So a physical AI system might be controlling a robotic arm or some other piece of equipment, changing the physical world. Yeah, I think there are three major use cases for physical AI. Okay.

Okay. It's all around simulation. The first one is, you know, when you train a physical AI system, you train a deep learning model, you have a thousand checkpoints. Do you know which one you want to deploy? Right? Right. And if you deploy individually, you're going to be very time consuming. And so, it's bad. It's going to damage your kitchen. Right? So,

With a wall model, you can do verification in the simulation. So you can quickly test out this policy in many, many different kitchens. And before...

you deploy in the real kitchen. After these verification step, you maybe narrow down to three checkpoints and then you do the real deployments. You can have an easier life to deploy your physical AI. It reminds me of when we've had podcasts about drug discovery.

and the guests talking about the ability to simulate experiments and different molecular combinations and all of that work so that they can narrow it down to the ones that are worth trying in the actual the physical lab right so it sounds like you know similar like just being able to simulate everything and narrow it down must be such a huge advantage to developers yeah and second application is uh

you know, role model, if you can predict the futures, you have some kind of understanding of basics. You might know the action required

to drive the world toward a future. The policy model, the typical one deployed in physical AI is all about predicting the action, right action given the observation. World model can be used as initialization for the policy model and then you can train the policy model with less amount of data because the world model is already pre-trained with many different observations that's from the data assets. Without a world model,

What's the procedure of training a policy like? So one procedure is you collect data.

And then you start to do the supervised by tuning. Right. And then you may use. So it's hands-on, it's manual. You have to get all the data. It's a lot. Yeah. And third one is when one model is good enough, highly accurate and fast, you know, before robot taking any actions, you just simulate different features. Right. And the check which you want really achieve your goal and take that one.

Yeah, it's like I have a data exchange next to you before you're making any decision. Would they degrade? You mentioned accuracy when the models are fast enough and accurate enough. And I don't know if it's a fair question to ask. So ask it, interpret it the best way. But like, how do you determine accuracy on a, or measure accuracy on a world model? And is there a benchmark that, you know, different benchmarks you need to hit to deploy in different situations? Or how does that work?

Yeah, it's a great question. So I think a role model development is still in its infancy. Right. So people are still trying to figure out the right way to measure the role model performance. And I think there are several aspects a role model must have. One is follow the law of physics. When you're dropping a ball, you should predict it's in the right position, basically.

on the physical logs, right? And also in the 3D environment, we have to have object permanence, right? So for you,

turn back and come back, you know, the object should remain there, right? Without any other players, it should remain in the same location. So there are many different aspects I think we need to capture. I think an important part for the research community is to come out with the right benchmark so that the community can move forward in the right location to democratize this important area. Right.

Right. So speaking of moving forward, maybe we can talk a little bit, or you can talk a little bit about Cosmos and what was announced at CES. So

In CS, Jensen announced the Cosmos World Model Development Platform. It's a developer-first world model platform. So in this platform, there are several components. One is pre-trained world foundation models. We have two kinds of world foundation models. One is based on diffusion. The other is based on autoregressive.

And we also have tokenizers for the walls, foundation models. Tokenizers compress videos into tokens so that transformers can consume for their tasks. In addition to these two, we also provide post-training scripts to help physical AI builders to fine-tune the pre-trained model to their physical AI setup.

Some cars have eight cameras, right? And we rely on our World Foundation model to predict eight views. And lastly, we also have this video curation toolkit. So processing videos, a lot of video is already a computing task. There are many pieces that need to be processed. And media gather libraries as they're ready to do computation code together, want to help

the one model developers leverage the library to read data. Either they want to build their own role models or find you one based on our pre-trained role foundation models. So the models provided as part of Cosmos, those are open to developers to use, they open to other businesses, enterprises? Yes. So this is the open-weight development platform. Meaning that the model is open-weight, the model weights are released before commercial use.

we feel this is important to physical ed builders. Physical ed builders, they need to solve tons of problems to build really useful robots, self-driving cars.

for our society. There are so many problems and a role model is one of them. And those companies, they may not have the resources or expertise to build a role model. NVIDIA cares about our developers and we know many of them are trying to make a

huge impact in physical AI. So we want to help them. That's why we create this role model development platform for them to leverage so that they can handle other problems and we can contribute our art to the transformation of our society. Absolutely. I wanted to ask you, can you explain a little bit about the difference between diffusion models and autoregressive models, particularly in this context? Why offer both? What are the use cases and perspectives

Pros and cons. So, auto-regressive model or AR model is a model that predicts token once at a time, condition what has been observed. So, GPT is probably the most popular auto-regressive model predict token at a time. Diffusion, on the other hand, is a model that predicts a set of token together.

and to iteratively remove noises from these initial tokens. And the difference is that for AR model, with a significant amount of investment in GBT, there are so many optimizations, so they can run very fast. And Deep Fusion, because tokens are generated together, so it's easier to have coherent tokens. The generation quality tends to be better.

And both of them are useful for physical AI builders. So some of them need speed, some of them need high accuracy. So both are good. Excellent. So far, the most successful autoregressive model is based on discrete token prediction, like in GBT.

So you pretty much as a set of integers, tokens, and you produce them during training. And in the case of wall foundation models, it means you have to organize videos into a set of integers. And you can imagine it's a challenging compression task. And because of this compression, the auto-regressive model tend to struggle more on the accuracy, but it has other benefits. For example,

It's setting is easier integrated into the physical AI setup. Got it.

I'm speaking with Mingyu Liu. Mingyu is Vice President of Research at NVIDIA, and he's been telling us about world foundation models, including the announcement of NVIDIA Cosmos, the developer platform for world models that was announced during Jensen's CES keynote. So we've been talking a lot about, you've been explaining what a world model is, how it's similar and different to other types of AI models, just now the difference between autoregression and diffusion.

Let's kind of change gears a little bit and talk about the applications. How will Cosmos, how are our World Foundation models going to impact industries? Yeah, so we believe that...

First of all, the World Foundation model can be used as a synthetic data generation engine to generate different synthetic data. And like what I said earlier, the World model can also be used as a policy evaluation tool to determine which checkpoint or which policy is

a better candidate for you to test out in the physical world. Right. And also, if you can predict the future, you probably can reconfigure it to predict the action toward that future. So it's a policy returning initialization. Right, right. And also, to have a data exchange next to you before any endeavor. So during the test time, schedule rollout and pick the best decision for each moment. Are there particular industries I know...

work in factories and industrial work, anything involving robotics, but are there specific industries that you see benefiting from world models maybe sooner than others? Yes, I think the self-driving car industry and the humanoid robot industry will benefit a lot from these world model developments. They can simulate different environments that will be difficult to have in the real world.

to make sure the agent behaves effectively. Right. So I think these are two very exciting industries the world models can impact. And NVIDIA obviously has a long history, as you were saying, of, you know, it's not just about rolling out the hardware. There's the software, the stack, the ecosystem, all of the work to support developers because if

the devs aren't building world-changing things with the products, then there's a problem, right? What are some of the partnerships, the ecosystems relative to World Foundation models? And maybe there are some partners who are already doing some interesting stuff with the tech you can talk about. Yes. We are working with a couple of humanoid companies and sales-driving car companies, including 1X, Wabi, D'Auto, S10, and many others. Right. So,

So, NVIDIA believes in suffering. We believe that true greatness comes from suffering. So, working with our partners, we can look at the challenges they are facing to experience their pain and to help us to build a role model platform that is really beneficial to them. Fantastic. Yeah. So, I think this is the important part to make the field move faster. Absolutely.

All right. So you talked about being able to predict the future and you talked about just now that things moving faster. What do you see on the horizon? What's next for World Foundation models? Where do you see this going in the next, you know, five years or adjust that time frame to whatever makes sense? So I'm trying to be a world model now, try to predict the future. Exactly. Yeah. Yeah.

Yes. I believe we are still in the infancy of world foundation model development. The model can do physics to some extent, but not well or robust enough. That's the critical point to make a huge transformation. It's useful, but we need to make it more useful. So the field of AI advanced very fast. And so from GPT-3 to CheGPT is just...

Right. Yeah, we forget it's all going so quickly. Yeah, it's going so fast. And I believe physical AI development will be very fast too because the infrastructure for large-scale model has been established. So this large-density model transformation, right?

And there's a strong need to have physical assistance for dry soil, driving cars, for humanoid. And there are also a lot of investments. So we have the Great Foundation and many young researchers want to make a difference. And we also have great need and the investment. I think this is going to be a very exciting area and things are going to move very fast.

I don't want to say that it will be solved in five years or ten years. So I think it's still a long way. And more importantly, we also need to study

how to best integrate these role models into the physical AI systems in a way that can really benefit them. Right. And does that come through just working with partners out in the field, kind of combining research with application and iterating and learning? Yeah, I believe so. I believe in suffering. So I believe that to hand-in-hand with our partners, understand their problems is the best way to make progress. For

For folks who would like to learn more about any aspects of what we're talking about, there are obviously resources on the NVIDIA site and, of course, the coverage of Jensen's keynote and the announcements. Are there specific places, maybe a research blog, maybe your own blog or social media channels, where people can go to learn more about NVIDIA's work with

world models and anything else you think the listeners might find interesting? Yes. So we have a white paper written for the Cosmos world model of

Perfect. We welcome you to download and take a read and let me know whether it's useful to you and let me know the feedback and we will try to do better for the next one. Excellent. Mingyu, it's an absolute pleasure talking to you. I definitely learned more about world models and some of the particulars and the applications.

going forward. So I thank you for that. I'm sure the audience did as well. But, you know, the work that you're doing, as you said, it's early innings and it's all changing so fast. So we will all keep an eye on the research that you're doing in the applications and best of luck with it. And I look forward to catching up again and seeing how quickly things evolve from here on out. Thank you. Thanks for having me. It's been fun. And I hope next time I can share more, you know, maybe more advanced version of the role model.

Absolutely. Well, thank you again for joining the podcast. Thank you. Thank you.

How World Foundation Models Will Advance Physical AI With NVIDIA’s Ming-Yu Liu - Episode 240 20:31 Share

The AI Podcast

Deep Dive

Shownotes Transcript

How World Foundation Models Will Advance Physical AI With NVIDIA’s Ming-Yu Liu - Episode 240