A World Foundation Model is a deep learning-based space-time visual simulator that predicts future scenarios by simulating environments, human intentions, and activities. It acts as a data exchange for AI, enabling the generation of virtual worlds based on text, image, video, or action prompts. These models are customizable for different physical AI setups, such as varying camera configurations, to simulate and predict outcomes in real-world environments.
A World Foundation Model differs from an LLM, which generates text descriptions, by focusing on simulating environments and generating videos. While video foundation models create videos for various use cases, World Foundation Models specifically generate simulations based on current observations and actor intentions, predicting future scenarios. They emphasize physics-based accuracy and object permanence in 3D environments.
World Foundation Models are crucial for physical AI developers because they enable simulation-based training and verification of AI systems before real-world deployment. This reduces the risk of physical harm or damage caused by AI systems interacting with the environment. They also help in policy evaluation, synthetic data generation, and predicting future actions, making physical AI deployment safer and more efficient.
The key use cases include: 1) Simulating and verifying AI policies before real-world deployment to avoid damage, 2) Predicting future scenarios to guide AI actions, and 3) Serving as a policy initialization tool, reducing the amount of training data required. These models act as a 'data exchange' for decision-making, enabling AI systems to simulate outcomes and choose optimal actions.
NVIDIA Cosmos is a developer-first platform for World Foundation Models, announced at CES. It includes pre-trained models (diffusion and autoregressive), tokenizers for video compression, post-training scripts for fine-tuning, and a video curation toolkit. The platform is open-weight, allowing developers to customize models for specific physical AI setups, such as self-driving cars or robotics, and accelerate development in the field.
The self-driving car and humanoid robot industries are expected to benefit significantly from World Foundation Models. These models enable simulation of complex environments that are difficult to replicate in the real world, ensuring AI agents behave effectively. NVIDIA is already collaborating with companies like 1X, Wabi, D'Auto, and S10 to integrate these models into their systems.
World Foundation Models are still in their infancy, with challenges in achieving robust and accurate physics-based simulations. Current models can simulate physics to some extent but lack the robustness needed for widespread application. The research community is working on establishing benchmarks to measure performance and improve the integration of these models into physical AI systems.
Autoregressive models predict tokens one at a time, making them faster due to optimizations like those in GPT. Diffusion models predict a set of tokens together, iteratively removing noise, which results in higher coherence and generation quality. Both are useful for physical AI: autoregressive models for speed and diffusion models for accuracy. NVIDIA Cosmos offers both to cater to different developer needs.
Hello, and welcome to the NVIDIA AI Podcast. I'm your host, Noah Kravitz. NVIDIA CEO Jensen Huang recently keynoted the
the CES Consumer Electronics Show conference in Las Vegas, Nevada. Amongst the many exciting announcements Jensen talked about was NVIDIA Cosmos. Cosmos is a development platform for world foundation models, which I think we're all going to be talking a lot about in the coming months and years. What is a world foundation model? Well,
Thankfully, we've got an expert here to tell us all about it. Mingyu Liu is Vice President of Research at NVIDIA. He's also an IEEE Fellow, and he's here to tell us all about World Foundation Models, how they work, what they mean, and why we should care about them going forward. So without further ado, Mingyu, thank you so much for joining the NVIDIA AI Podcast, and welcome. It's a pleasure to be here. So let's start with the basics, if you would. What is a World Foundation Model? Sure.
So World Foundation models are deep learning based space-time visual simulator that can help us look into the future. It can simulate visits, it can simulate people's intentions and activities. It's like data exchange of AI. Imagine many different environments and can simulate the future. So we can make good decisions based on this simulation.
We can leverage World Foundation models imagination and simulation capability to help train physical AI agents. We can also leverage this capability to help the agent make good decision during the inference time. You can generate a virtual world based on text prompts, image prompts, video prompts, action prompts, and the layer combinations. We call it a World Foundation model because it can generate many different worlds.
and also because it can be customized to different physical AI setups. - Right. - To become a customized world model, right? So different physical AI have different number of cameras in different locations. So we want the World Foundation model to be customizable for different physical AI setups so they can use in their settings. - So I want to ask you kind of how a world model is similar or different to an LLM and other types of models.
But I think first I want to back up a step and ask you, how is a world model similar or different to a model that generates video? Because my understanding, and please correct me when I'm wrong, my understanding is that you can prompt a world model to generate a video.
But that video is generated based on the things you were talking about, based on understanding of physics and other things in the physical world. And it's a different process. So I don't know what the best way is to kind of unpack it for the listeners. But one place to start might be, how does a world model differentiate from an LLM or a generative AI video model? So...
World model is different to LN in the sense that LN is focused on generating text description. It generates understanding. And world model is generating simulation. And the most common form of simulation is videos. So they are generating pixels. And so world models and video foundation models, they are related.
Video Foundation Model is a general model that generates videos. It can be for creative use cases, it can be for other use cases. In world models, we are focusing on this aspect of video generation. Based on your current observation and the intention of
of the actors in your world, you roll out the future. Right. Yeah. So they are related, but with a different focus. Gotcha. Thank you. So why do we need the world models? I mean, I think I know part of the answer to the question. We're talking about simulating physical AI and all of these amazing things. But what's the, you know, tell us about the need for world foundation models from your perspective.
So I think world foundation models is important to physical AI developers. You know, physical AI are systems with AI deployed in the real world
And different to digital AI, these physical AI systems, they interact with the environment and create damage. So this could be real harm. Right. So a physical AI system might be controlling a robotic arm or some other piece of equipment, changing the physical world. Yeah, I think there are three major use cases for physical AI. Okay.
Okay. It's all around simulation. The first one is, you know, when you train a physical AI system, you train a deep learning model, you have a thousand checkpoints. Do you know which one you want to deploy? Right? Right. And if you deploy individually, you're going to be very time consuming. And so, it's bad. It's going to damage your kitchen. Right? So,
With a wall model, you can do verification in the simulation. So you can quickly test out this policy in many, many different kitchens. And before...
you deploy in the real kitchen. After these verification step, you maybe narrow down to three checkpoints and then you do the real deployments. You can have an easier life to deploy your physical AI. It reminds me of when we've had podcasts about drug discovery.
and the guests talking about the ability to simulate experiments and different molecular combinations and all of that work so that they can narrow it down to the ones that are worth trying in the actual the physical lab right so it sounds like you know similar like just being able to simulate everything and narrow it down must be such a huge advantage to developers yeah and second application is uh
you know, role model, if you can predict the futures, you have some kind of understanding of basics. You might know the action required
to drive the world toward a future. The policy model, the typical one deployed in physical AI is all about predicting the action, right action given the observation. World model can be used as initialization for the policy model and then you can train the policy model with less amount of data because the world model is already pre-trained with many different observations that's from the data assets. Without a world model,
What's the procedure of training a policy like? So one procedure is you collect data.
And then you start to do the supervised by tuning. Right. And then you may use. So it's hands-on, it's manual. You have to get all the data. It's a lot. Yeah. And third one is when one model is good enough, highly accurate and fast, you know, before robot taking any actions, you just simulate different features. Right. And the check which you want really achieve your goal and take that one.
Yeah, it's like I have a data exchange next to you before you're making any decision. Would they degrade? You mentioned accuracy when the models are fast enough and accurate enough. And I don't know if it's a fair question to ask. So ask it, interpret it the best way. But like, how do you determine accuracy on a, or measure accuracy on a world model? And is there a benchmark that, you know, different benchmarks you need to hit to deploy in different situations? Or how does that work?
Yeah, it's a great question. So I think a role model development is still in its infancy. Right. So people are still trying to figure out the right way to measure the role model performance. And I think there are several aspects a role model must have. One is follow the law of physics. When you're dropping a ball, you should predict it's in the right position, basically.
on the physical logs, right? And also in the 3D environment, we have to have object permanence, right? So for you,
turn back and come back, you know, the object should remain there, right? Without any other players, it should remain in the same location. So there are many different aspects I think we need to capture. I think an important part for the research community is to come out with the right benchmark so that the community can move forward in the right location to democratize this important area. Right.
Right. So speaking of moving forward, maybe we can talk a little bit, or you can talk a little bit about Cosmos and what was announced at CES. So
In CS, Jensen announced the Cosmos World Model Development Platform. It's a developer-first world model platform. So in this platform, there are several components. One is pre-trained world foundation models. We have two kinds of world foundation models. One is based on diffusion. The other is based on autoregressive.
And we also have tokenizers for the walls, foundation models. Tokenizers compress videos into tokens so that transformers can consume for their tasks. In addition to these two, we also provide post-training scripts to help physical AI builders to fine-tune the pre-trained model to their physical AI setup.
Some cars have eight cameras, right? And we rely on our World Foundation model to predict eight views. And lastly, we also have this video curation toolkit. So processing videos, a lot of video is already a computing task. There are many pieces that need to be processed. And media gather libraries as they're ready to do computation code together, want to help
the one model developers leverage the library to read data. Either they want to build their own role models or find you one based on our pre-trained role foundation models. So the models provided as part of Cosmos, those are open to developers to use, they open to other businesses, enterprises? Yes. So this is the open-weight development platform. Meaning that the model is open-weight, the model weights are released before commercial use.
we feel this is important to physical ed builders. Physical ed builders, they need to solve tons of problems to build really useful robots, self-driving cars.
for our society. There are so many problems and a role model is one of them. And those companies, they may not have the resources or expertise to build a role model. NVIDIA cares about our developers and we know many of them are trying to make a
huge impact in physical AI. So we want to help them. That's why we create this role model development platform for them to leverage so that they can handle other problems and we can contribute our art to the transformation of our society. Absolutely. I wanted to ask you, can you explain a little bit about the difference between diffusion models and autoregressive models, particularly in this context? Why offer both? What are the use cases and perspectives
Pros and cons. So, auto-regressive model or AR model is a model that predicts token once at a time, condition what has been observed. So, GPT is probably the most popular auto-regressive model predict token at a time. Diffusion, on the other hand, is a model that predicts a set of token together.
and to iteratively remove noises from these initial tokens. And the difference is that for AR model, with a significant amount of investment in GBT, there are so many optimizations, so they can run very fast. And Deep Fusion, because tokens are generated together, so it's easier to have coherent tokens. The generation quality tends to be better.
And both of them are useful for physical AI builders. So some of them need speed, some of them need high accuracy. So both are good. Excellent. So far, the most successful autoregressive model is based on discrete token prediction, like in GBT.
So you pretty much as a set of integers, tokens, and you produce them during training. And in the case of wall foundation models, it means you have to organize videos into a set of integers. And you can imagine it's a challenging compression task. And because of this compression, the auto-regressive model tend to struggle more on the accuracy, but it has other benefits. For example,
It's setting is easier integrated into the physical AI setup. Got it.
I'm speaking with Mingyu Liu. Mingyu is Vice President of Research at NVIDIA, and he's been telling us about world foundation models, including the announcement of NVIDIA Cosmos, the developer platform for world models that was announced during Jensen's CES keynote. So we've been talking a lot about, you've been explaining what a world model is, how it's similar and different to other types of AI models, just now the difference between autoregression and diffusion.
Let's kind of change gears a little bit and talk about the applications. How will Cosmos, how are our World Foundation models going to impact industries? Yeah, so we believe that...
First of all, the World Foundation model can be used as a synthetic data generation engine to generate different synthetic data. And like what I said earlier, the World model can also be used as a policy evaluation tool to determine which checkpoint or which policy is
a better candidate for you to test out in the physical world. Right. And also, if you can predict the future, you probably can reconfigure it to predict the action toward that future. So it's a policy returning initialization. Right, right. And also, to have a data exchange next to you before any endeavor. So during the test time, schedule rollout and pick the best decision for each moment. Are there particular industries I know...
work in factories and industrial work, anything involving robotics, but are there specific industries that you see benefiting from world models maybe sooner than others? Yes, I think the self-driving car industry and the humanoid robot industry will benefit a lot from these world model developments. They can simulate different environments that will be difficult to have in the real world.
to make sure the agent behaves effectively. Right. So I think these are two very exciting industries the world models can impact. And NVIDIA obviously has a long history, as you were saying, of, you know, it's not just about rolling out the hardware. There's the software, the stack, the ecosystem, all of the work to support developers because if
the devs aren't building world-changing things with the products, then there's a problem, right? What are some of the partnerships, the ecosystems relative to World Foundation models? And maybe there are some partners who are already doing some interesting stuff with the tech you can talk about. Yes. We are working with a couple of humanoid companies and sales-driving car companies, including 1X, Wabi, D'Auto, S10, and many others. Right. So,
So, NVIDIA believes in suffering. We believe that true greatness comes from suffering. So, working with our partners, we can look at the challenges they are facing to experience their pain and to help us to build a role model platform that is really beneficial to them. Fantastic. Yeah. So, I think this is the important part to make the field move faster. Absolutely.
All right. So you talked about being able to predict the future and you talked about just now that things moving faster. What do you see on the horizon? What's next for World Foundation models? Where do you see this going in the next, you know, five years or adjust that time frame to whatever makes sense? So I'm trying to be a world model now, try to predict the future. Exactly. Yeah. Yeah.
Yes. I believe we are still in the infancy of world foundation model development. The model can do physics to some extent, but not well or robust enough. That's the critical point to make a huge transformation. It's useful, but we need to make it more useful. So the field of AI advanced very fast. And so from GPT-3 to CheGPT is just...
Right. Yeah, we forget it's all going so quickly. Yeah, it's going so fast. And I believe physical AI development will be very fast too because the infrastructure for large-scale model has been established. So this large-density model transformation, right?
And there's a strong need to have physical assistance for dry soil, driving cars, for humanoid. And there are also a lot of investments. So we have the Great Foundation and many young researchers want to make a difference. And we also have great need and the investment. I think this is going to be a very exciting area and things are going to move very fast.
I don't want to say that it will be solved in five years or ten years. So I think it's still a long way. And more importantly, we also need to study
how to best integrate these role models into the physical AI systems in a way that can really benefit them. Right. And does that come through just working with partners out in the field, kind of combining research with application and iterating and learning? Yeah, I believe so. I believe in suffering. So I believe that to hand-in-hand with our partners, understand their problems is the best way to make progress. For
For folks who would like to learn more about any aspects of what we're talking about, there are obviously resources on the NVIDIA site and, of course, the coverage of Jensen's keynote and the announcements. Are there specific places, maybe a research blog, maybe your own blog or social media channels, where people can go to learn more about NVIDIA's work with
world models and anything else you think the listeners might find interesting? Yes. So we have a white paper written for the Cosmos world model of
Perfect. We welcome you to download and take a read and let me know whether it's useful to you and let me know the feedback and we will try to do better for the next one. Excellent. Mingyu, it's an absolute pleasure talking to you. I definitely learned more about world models and some of the particulars and the applications.
going forward. So I thank you for that. I'm sure the audience did as well. But, you know, the work that you're doing, as you said, it's early innings and it's all changing so fast. So we will all keep an eye on the research that you're doing in the applications and best of luck with it. And I look forward to catching up again and seeing how quickly things evolve from here on out. Thank you. Thanks for having me. It's been fun. And I hope next time I can share more, you know, maybe more advanced version of the role model.
Absolutely. Well, thank you again for joining the podcast. Thank you. Thank you.