World models are essential because they allow AI systems to understand and predict the three-dimensional world, enabling tasks like reasoning, planning, and common sense reasoning, which are currently beyond the capabilities of large language models (LLMs).
LLMs are limited to one-dimensional predictions (text) and lack a deep understanding of the physical world. They struggle with tasks requiring common sense, causal reasoning, and practical application of knowledge, unlike humans who learn these skills quickly through interaction with the environment.
World models are three-dimensional representations of the world that allow AI to predict outcomes of actions and understand cause-and-effect relationships. LLMs, on the other hand, are trained on text data and lack intrinsic understanding of the physical world, relying solely on linguistic patterns.
Both World Labs and Google's Genie 2 are pioneering the development of world models, which are seen as a critical step toward achieving AGI. These models promise to unlock significantly smarter AI systems by enabling them to perceive and interact with the physical world more effectively.
Building world models is computationally intensive and requires solving complex problems related to perception, reasoning, and planning. Additionally, integrating these models into practical AI systems remains a significant technical and engineering challenge.
Sensors allow AI systems to perceive the environment, while embodiment enables interaction with the physical world, which is crucial for learning cause-and-effect relationships. Without these, AI systems are limited to passive observation and cannot fully develop a robust world model.
The human brain learns world models through sensory-motor learning, where it predicts and observes outcomes of actions. This process is fundamental to developing common sense and understanding the physical world, which AI systems currently lack.
Experts like Jan LeCun believe AGI is still decades away due to the limitations of current AI systems, which lack a deep understanding of the world. World models are seen as a potential solution but are still in the early stages of development.
LLMs acquire knowledge from vast datasets but struggle to update their knowledge easily. They rely on retraining for new information, unlike humans who can assimilate new facts quickly with minimal exposure.
The neocortex is a prediction machine that learns world models through sensory-motor learning. It predicts outcomes of actions and updates its model based on discrepancies between predictions and actual sensory responses, which is key to developing common sense.
Today on the AI Daily Brief, we are discussing world models and what they mean for AGI. The AI Daily Brief is a daily podcast and video about the most important news and discussions in AI. To join the conversation, follow the Discord link in our show notes.
Hello, friends. Well, we have had a couple interesting examples of world models showing up in the news this week. We had Fei-Fei Li's World Labs previewing some of the things that they had built. And then we also got Google DeepMind releasing their paper about Genie 2. I mentioned earlier that world models are interesting not just because of their cool capabilities, but because of what they might mean in terms of our progress towards artificial general intelligence.
Today, we're going to read two pieces that speak to that. The first is an article from TechCrunch about Meta's AI lead, Jan LeCun, and what he says about how world models could be the key to advances in AI. I'm going to turn it over to the 11 Labs version of me to read this piece, and then I will be back. Meta's AI chief says world models are key to human-level AI, but it might be 10 years out.
Are today's AI models truly remembering, thinking, planning, and reasoning, just like a human brain would? Some AI labs would have you believe they are, but according to Meta's chief AI scientist, Jan LeCun, the answer is no.
He thinks we could get there in a decade or so, however, by pursuing a new method called a world model. Earlier this year, OpenAI released a new feature it calls Memory that allows ChatGPT to remember your conversations. The startup's latest generation of models, O1, displays the word thinking while generating an output, and OpenAI says the same models are capable of complex reasoning.
That all sounds like we're pretty close to Artificial General Intelligence, AGI. However, during a recent talk at the Hudson Forum, Lacan undercut AI optimists such as ex-AI founder Elon Musk and Google DeepMind co-founder Shane Legg, who suggest human-level AI is just around the corner. "We need machines that understand the world, machines that can remember things, that have intuition, have common sense, things that can reason and plan to the same level as humans," said Lacan during the talk.
Despite what you might have heard from some of the most enthusiastic people, current AI systems are not capable of any of this. Lacan says today's large language models, LLMs, like those that power chat GPT and meta-AI, are far from human-level AI. Humanity could be years to decades away from achieving such a thing, he later said. That doesn't stop his boss, Mark Zuckerberg, from asking him when AGI will happen, though.
The reason why is straightforward. Those LLMs work by predicting the next token, usually a few letters or a short word, and today's image-video models are predicting the next pixel. In other words, language models are one-dimensional predictors, and AI image-video models are two-dimensional predictors. These models have become quite good at predicting in their respective dimensions, but they don't really understand the three-dimensional world. Because of this, modern AI systems cannot do simple tasks that most humans can't.
LeCun notes how humans learn to clear a dinner table by the age of 10 and drive a car by 17, and learn both in a matter of hours. But even the world's most advanced AI systems today, built on thousands or millions of hours of data, can't reliably operate in the physical world. In order to achieve more complex tasks, LeCun suggests we need to build three-dimensional models that can perceive the world around you and center around a new type of AI architecture, world models.
A world model is your mental model of how the world behaves, he explained. You can imagine a sequence of actions you might take, and your world model will allow you to predict what the effect of the sequence of action will be on the world.
Consider the world model in your own head. For example, imagine looking at a messy bedroom and wanting to make it clean. You can imagine how picking up all the clothes and putting them away would do the trick. You don't need to try multiple methods or learn how to clean a room first. Your brain observes the three-dimensional space and creates an action plan to achieve your goal on the first try. That action plan is the secret sauce that AI world models promise.
Part of the benefit here is that world models can take in significantly more data than LLMs. That also makes them computationally intensive, which is why cloud providers are racing to partner with AI companies. World models are the big idea that several AI labs are now chasing, and the term is quickly becoming the next buzzword to attract venture funding.
A group of highly regarded AI researchers, including Fei-Fei Li and Justin Johnson, just raised $230 million for their startup, World Labs. The godmother of AI and her team is also convinced world models will unlock significantly smarter AI systems. OpenAI also describes its unreleased Sora video generator as a world model but hasn't gotten into specifics.
In a 2022 paper on objective-driven AI, Lacan outlined an idea for using world models to create human-level AI, though he notes the concept is over 60 years old. In short, a base representation of the world, such as video of a dirty room, and memory are fed into a world model.
Then the world model predicts what the world will look like based on that information. Then you give the world model objectives, including an altered state of the world you'd like to achieve, such as a clean room, and guardrails to ensure the model doesn't harm humans to achieve an objective. Don't kill me in the process of cleaning my room, please. Then the world model finds an action sequence to achieve these objectives.
Meta's long-term AI research lab, FAIR, Fundamental AI Research, is actively working toward building objective-driven AI and world models, according to LeCun. FAIR used to work on AI for Meta's upcoming products, but LeCun says the lab has shifted in recent years to focusing purely on long-term AI research. LeCun says FAIR doesn't even use LLMs these days. World models are an intriguing idea, but LeCun says we haven't made much progress on bringing these systems to reality.
There's a lot of very hard problems to get from where we are today. And he says it's certainly more complicated than we think. It's going to take years before we can get everything here to work, if not a decade, said Lacan. Mark Zuckerberg keeps asking me how long it's going to take. All right, back to Real Me. Now you have a little bit of context. And with that, we're going to flip it over to Lawrence Knight, who wrote a post on Medium called Toward AGI, World Models and Why We Need Them. This is a longer, more comprehensive piece. And so once again, I'm turning it over to Eleven Labs to read.
Today's episode is brought to you by Vanta. Whether you're starting or scaling your company's security program, demonstrating top-notch security practices and establishing trust is more important than ever.
Vanta automates compliance for ISO 27001, SOC 2, GDPR, and leading AI frameworks like ISO 42001 and NIST AI risk management framework, saving you time and money while helping you build customer trust. Plus, you can streamline security reviews by automating questionnaires and demonstrating your security posture with a customer-facing trust center all powered by Vanta AI. Over 8,000 global companies like Langchain, Leela AI, and Factory AI use Vanta to demonstrate AI trust and prove security in real time.
Learn more at vanta.com slash nlw. That's vanta.com slash nlw. Today's episode is brought to you, as always, by Superintelligent.
Have you ever wanted an AI daily brief but totally focused on how AI relates to your company? Is your company struggling with AI adoption, either because you're getting stalled figuring out what use cases will drive value or because the AI transformation that is happening is siloed at individual teams, departments, and employees and not able to change the company as a whole? Superintelligent has developed a new custom internal podcast product that inspires your teams by sharing the best AI use cases from inside and outside your company.
Think of it as an AI daily brief, but just for your company's AI use cases. If you'd like to learn more, go to besuper.ai slash partner and fill out the information request form. I am really excited about this product, so I will personally get right back to you. Again, that's besuper.ai slash partner. Toward AGI, world models and why we need them, AI mind. Current AI systems seem to be lacking common sense. Could world models be the answer?
Introduction. The purpose of this article is to share some thoughts on where we are on the journey toward Artificial General Intelligence, AGI. There is currently a lot of excitement around AI, driven by the stunning success of large language models, LLMs, and their ability to capture the imagination of the media and public.
This has fueled speculation that we may be on the edge of an AGI revolution, with all the risks and opportunities that brings with it. I am a cognitive science functionalist believing there are no fundamental barriers to machines achieving human-like intelligence. However, I am not convinced the technology we currently have at our disposal puts us on an inexorable path toward AGI. I will argue that for artificial systems to achieve something like human intelligence, they will need to learn world models.
If this is true, and there are many good reasons to believe it is, then we are not as far along the path toward AGI as some would like us to believe. The intention is for this article to serve as an introduction to a series of articles that will explore further the subject of AGI.
What we'll cover. This article provides brief commentary on the following questions. What is intelligence? Where are we on the journey to AGI? Are large language models AGI? What are world models? What is needed to learn a world model? An appropriate learning architecture. Sensors. Embodiment. Before we start talking about machine intelligence, it is worth considering what we mean by intelligence. The Oxford English Dictionary defines intelligence to be the ability to acquire and apply knowledge and skills. This is a very broad definition.
To allow us to be more precise in what we mean when we refer to human intelligence, we can break it down into subcategories such as spatial intelligence, bodily kinesthetic intelligence, musical intelligence, linguistic intelligence, logical mathematical skills, interpersonal intelligence, intrapersonal intelligence, naturalistic intelligence.
So perhaps a good starting definition of human intelligence is the ability to acquire and apply knowledge and skills across the range of above-referenced intelligence categories. AGI tends to be defined in relation to human intelligence. For example, an AI system that is at least as capable as a human at most tasks. When using this type of definition, it is important to be clear whether we are talking about all tasks, both cognitive and physical, or as is more typical, just cognitive tasks.
As we progress through this article and start to dig into the idea of world models and why they will likely be important in the development of AGI, my argument is that world models are important to systems trying to replicate any and all types of human intelligence. Where are we on the journey to AGI? The recent success of large language models, LLMs, at generating human-like text as well as demonstrating some limited capacity for reasoning has sparked a lot of interest in AI.
Some in the field are even suggesting the very latest LLMs are showing the first signs of AGI.
The team at DeepMind recently proposed a framework for classifying the capabilities and behavior of artificial general intelligence, AGI models, and their precursors in their paper levels of AGI, operationalizing progress on the path to AGI. We see from Table 1 that whilst to date we have been very successful at building specialist narrow AI systems, we have been almost entirely unsuccessful at creating generally intelligent AI systems.
The authors of the paper suggest that ChatGPT, BARD, and LLAMA2 meet the criteria for an emerging AGI. I do not agree with this. Having worked with these systems, I do not consider them equal to or somewhat better than an unskilled human at a wide range of cognitive tasks as they show little understanding of how the world works and so have very limited practical reasoning skills. Are large language models AGI? Language models are able to produce text that is written in a very human-like way.
This can lead us to attribute a level of intelligence to these systems that is perhaps unfounded. If we consider the broad definition of intelligence, it is about the ability of a system to acquire and apply knowledge. So let's consider each of these characteristics. Language models have been able to acquire a huge amount of knowledge from the internet-scale datasets used to train them. This knowledge comes at a significant cost, however. Once trained, language models do not acquire new information easily, and typically require full bottom-up retraining to acquire even small amounts of new information.
This is very different to human learning, where new facts can be assimilated easily, often after being exposed to the new information only once. In regard the application of knowledge, language models are excellent at retrieving relevant knowledge and presenting it in well-written text. Where they do less well is in the practical application of what we might call common sense. As Jan LeCun put it, LLMs are trained on text data that would take 20,000 years for a human to read, and still they haven't learned that if A is the same as B, then B is the same as A.
Lacan 2022. There are techniques such few-shot prompting that help LLMs to do better at reasoning, but this is nonetheless an inherent weakness. There is a strong argument to say that even with scaling, language models won't become better at reasoning about the world as they have no intrinsic understanding of the world. Lacan 2022. It is thought that language evolved relatively late in humans as an addition to an already generally intelligent brain.
The areas of the human brain most associated with language and speech are Wernicke area and Broca's area. These areas make up a very small part of the human brain, providing support to the notion that there is a lot going on in a generally intelligent brain other than language. Given this, there is no reason to suppose that language models, having a very narrow focus on the acquisition of language, should demonstrate a more general intelligence.
I would argue that what humans have that language models do not is a deep understanding of how the world works. And it is on top of this deep understanding that our ability for language is layered, allowing us to describe this fundamental understanding of the world. Language models, on the other hand, encode the structure of language using probability distributions of the co-occurrence of words in written text. This allows language models to describe various aspects of our world, but there is no deep understanding.
This lack of understanding arises because they are not grounded in real-world experience, leading them to struggle with many basic aspects of causal reasoning and physical common sense. What are world models? If large language models and other AI systems are not on a path toward AGI, the question becomes what is it specifically they are missing?
The consensus among thinkers such as Jan LeCun and Jeff Hawkins is that they are missing a model of the world. It is widely accepted in the cognitive neuroscience community that birds and mammals with a relatively advanced neocortex learn world models. This stands to reason with a little knowledge about the function of the neocortex. The neocortex is essentially a prediction machine that is constantly making predictions in every sensory modality on what it should expect to sense.
For the neocortex to be able to make predictions about the world, it must first learn what is normal. Brains learn a world model by interacting with their environment, a process known as sensorimotor learning. Essentially, the brain plans an action, predicts how the environment will change, and then observes the sensory responses and compares these to the prediction. Where the sensory responses are as predicted, the world model is confirmed. Where the brain's predictions are not confirmed, our attention is drawn to the area of misprediction, and the world model is updated.
One criticism that can be leveled at artificial systems such as LLMs is that they seem to lack common sense. The notion of common sense in humans can be thought of as a manifestation of having a robust world model. Common sense tells us what is likely, what is plausible, and what is impossible in the world we inhabit. LLMs only really understand what is likely and what is plausible from a linguistic point of view.
So specifically, what types of common sense knowledge do humans learn that so far AI systems have demonstrated little grasp of? Each of the following facts about our physical world are learned early in life by human infants and each is fundamental to our world model. We learn that the world is three-dimensional. We learn that every source of light, sound, and touch in the world has a distance from us. We learn the notion of objects.
The fact that objects can occlude more distant ones. Objects can be assigned to broad categories as a function of their appearance or behavior. Objects do not spontaneously appear, disappear, change shape, or teleport. They move smoothly and can only be in one place at any one time. Notions of intuitive physics such as stability, gravity, inertia, the effect of animate objects on the world, including the effects of the subject's own actions, can be used to deduce cause and effect relationships. What is needed to learn a world model?
If world models are fundamental to who we are, and to the development of generally intelligent systems, we should ask what is required to learn them. We'll start by trying to answer this question at a high level and then dig into some of the detail. At a high level, an AGI system would need an appropriate architecture and learning algorithm, sensors through which the world can be perceived, a body through which the system can interact with the world, appropriate drives and motivations to actively explore the world and learn a world model.
The human brain has evolved over millions of years a structure optimal to learning a model of the world. Replicating a structure with similar functionality is likely the single greatest challenge to developing an AGI. To learn a world model, it is likely that a system of modules working together will be required. A good place to start would be a system of modules that in some way mirror the function of the human brain.
Jan LeCun, a distinguished thinker and researcher in the field of computer science, and the founding father of convolutional neural nets, CNNs, proposes just such a system in his 2022 position paper A Path Towards Autonomous Machine Intelligence. In this paper, he proposes both an architecture and training paradigm for developing intelligent machines that can learn, reason, and plan like humans and animals. The paper is very detailed and runs to some 70 pages in length.
Some of the key takeaways include an AGI could be made up of a set of modules that mirror the functionality of specialist regions of the brain. The modules should be trainable through gradient-based learning. Perceptual-based inputs will need to be converted to representations that abstract away all superfluous detail.
Jan submits that the ability to represent sequences of world states at several levels of abstraction is essential to intelligent behavior. Basic behavioral motivations and drives will need to be hard-coded into the system, the equivalent to human drives to reduce hunger, fear, and pain. A short-term memory will be required to keep track of past, current, and predicted world states. Actions can either be driven by pre-learned automated responses to particular states or through a more involved process of reasoning and planning.
Multi-level representations of world states and actions can be used to decompose a complex task into successively more detailed subtasks. Another specific proposal about how machines might learn world models is given by Jeff Hawkins in his book, A Thousand Brains. The theory suggests that the neocortex learns many complete models of objects and concepts, and that these models work together to create your perception of the world. Key takeaways from the theory presented are, the neocortex is homogenous in structure.
Specialist regions for seeing, touching, language, and higher-level thought all have the same structure. The fundamental unit of the neocortex, the unit of intelligence, is the cortical column. Cortical columns implement a fundamental algorithm that is responsible for every aspect of perception and intelligence. Cortical columns in the neocortex learn a world model through sensory motor learning.
Predictions occur inside neurons. A prediction occurs when a neuron recognizes a pattern, creates a dendrite spike, and is primed to spike earlier than other neurons. The secret of the cortical column is reference frames.
Reference frames are key to understanding intelligence. The brain arranges all knowledge using reference frames, and thinking is a form of moving through positions in these reference frames. The difference in function between what and where cortical columns depends on what their reference frames are anchored to. Reference frames don't have to be anchored to something physical. A reference frame for a concept can exist independently of everyday physical things.
Jeff Hawkins leads a research team that is trying to implement a learning algorithm similar to that implemented by the cortical column. In his book, he states they have had success in this implementation, though they don't appear to have published details in any of their academic papers. The thousand brains theory is a compelling theory and seems to be a promising area for further research. All the theory I have read agrees that to learn a world model, a system will need to be able to sense the environment that it is operating in.
The sensors bestowed upon a system will fundamentally dictate the world model that is learned by that system. If we take the human eye as an example, the eye allows us to sense a very narrow band of the electromagnetic spectrum that we call visible light. We are able to sense how visible light interacts with the environment, but we are completely oblivious to parts of the spectrum such as x-rays, infrared rays, and radar.
This shapes our particular experience of our environment. It would be easy to imagine a system that has sensors calibrated for a different range of the electromagnetic spectrum that would experience the environment in a different, though no less meaningful, way. The notion that sensors are required to learn a world model raises questions around whether an AGI needs to learn its world model through interaction with the real world, or whether a world model can be learned in a virtual environment.
My thinking on this subject is not well developed, but my sense is that for a world model to be useful in the real world, the model would have to at least in part be learned through interaction with the real world. There is evidence for this in the development of autonomous vehicles, where systems can be partially developed in virtual environments, but the virtual environment cannot completely replace experience in the real world. Embodiment. Animals learn world models by interacting with the physical world. Barlow, 1989. This is known as sensory motor learning.
A key part of sensory motor learning is making predictions about how the environment will respond when the agent interacts with it, as such a body will be required to allow this environmental interaction. Without a body, the agent is reduced to passive observation, which would make it significantly more difficult to learn cause and effect relationships. If this is true, the fields of AI research and robotics will need to become more closely integrated through time. Conclusion We have covered a lot in a short space of time and have no doubt raised more questions than answers.
What I hope to convey in this article is that for us to achieve AGI, it is likely that we will need to develop systems that are able to learn world models in much the same way as generally intelligent biological systems do. This is not trivial, and the artificial systems we consider to be most intelligent today do not expressly learn a world model, a fact that is likely to constrain the level of intelligence they are able to achieve. Currently, it is research in autonomous vehicles where the most advanced world models are being developed.
Here we have embodied agents with an array of sensors learning world models through interaction with their environment. Huge levels of investment are being poured into this research, and yet there still seem to be significant barriers to the development of commercially viable autonomous vehicles, with many of the big players having pulled out of the race. This, I think, is testament to how hard learning world models is and provides some support to the notion that we are probably some way away from an AGI.
All right, back to real NLW here. Just a quick conclusion. Obviously, there is a ton to chew on. The point that I want to make and the introduction and hope of this episode is simply a reminder that these world models have bigger implications than just what they can do. There are many very smart people who think that they could be a key part of unlocking future AI capabilities, especially now with World Labs and Google's Genie coming to fruition. It's something that I expect to see a lot more of in 2025. And now you have the basis from which to engage.
Appreciate you guys listening as always. Until next time, peace.