top of page
  • Writer's pictureDhruv Tyagi

World Models: Why is it the missing link towards generalised autonomy?

AV 1.0

The self-driving cars have always been one year away since the AV boom that lasted from 2010 - 2019, built upon the robotics-first approach, often dubbed as AV 1.0. This led to huge investments & attention all around the world, but soon shortcomings started to come around.

AV 1.0 aka robotics paradigm, with its classic objective architecture encompassing tasks like object detection and localization, seems to fall short in identifying things that even a child might not have missed. The persisting long-tail problems, edge cases, instances of accidents  and various hilarious incidents showcase that all the big claims about AV 1.0 were just hogwash and far from being achieved.  

This poised a billion dollar question - were we really on the correct path to autonomy?

The AI we see today still lacks common-sense, reasoning behind decisions, understanding of the basic nature of their environment, and why the world is the way it is. They never know the intrinsic understanding of the world and causality of their actions, and they are as good as the data they were trained on. Hence it was plausible that they would fail in unpredictable scenarios, and driving in the real world definitely gives the worst of it.

Even though latest breakthroughs such as LLMs, Midjourney, etc. have left the world upside-down, they still lack the basic reasoning and logical understanding of their world visible in the form of hallucinations (or like they say ‘dreaming’). Not going much into the mathematical jargon, it’s important to understand that they are all pattern matching systems that try to imitate the data distributions they are trained upon and when anything that was not in the data or unpredictable comes up, they fail. It's easy for a human to get dillusioned by their fluency, but the cases like MidJourney’s finger problem advocates the lack of inherent understanding of what they generate.

Consequently, to minimise errors like the finger problem, these models are inundated with vast amounts of data to conclude how a distribution of pixels in a certain way might be closer to what a hand is, reaching a point where they evolve into impressive imitators. 

However, a discernible intelligence gap exists, as it’s been doing rote memorization. Having knowledge does not mean it is intelligent.

While the industry is going gaga over generative AI, researchers are now also able to see the potential gap due to lack of that “common-sense”. This gave rise to a new paradigm in foundational AI going beyond language and vision, called the World Models as the path to AGI. It is important for models to possess a fundamental understanding of the environment and actions to be able to make safe decision making in autonomous driving, robotics and other such fields.

Thus with recent advancements in AI research around the concept of World Models, there is a paradigm shift from AV 1.0 to AV 2.0, transitioning from a robotics approach to an AI-first approach. This shift focused rethinking autonomous driving from a foundational AI perspective.

What is a world model?

A world model is a foundational model that builds an abstract internal representation of its environment and further uses it to simulate future events within that environment. It is able to learn from the observations while interacting with the environment and other agents and aligns its actions accordingly. It is able to understand the inherent characteristics of its environment including spatio-temporal dimensions, physics, behaviour, interactions, etc. 

In layman’s terms, creating a world model for AI is giving it some degree of common sense.

These world models have a key concept; they are capable of predicting unforeseen events to help drive our actions and maximising agent’s probability of right decision while handling uncertainty.

From the recent AI researchers to the most influential ones in history, like Yann LeCun and Yoshua Bengio, they all view the world model as the key to evolve AI the way biological intelligence evolved - first achieving a cat-like intelligence and then to higher dimensions of human intelligence.

This brings a new argument of how the World Models will be architected - is it discriminative, generative or bayesian? And there have been multiple proposals exploring it-

While Yann LeCun’s proposition of Objective Driven AI, David Ha’s approach as an offspring of reinforcement learning, and Karl Friston’s exploration of intelligent agents with Active Inference do take a lot of inspiration from biological intelligence, there is strong generative school of thought with Wayve’s GAIA and OpenAI’s SORA.

Though you can argue that we have good generative breakthroughs that can generate realistic short videos, so it might have developed some understanding of physics and motion. However, it’s still very limited in its capabilities, struggling with complex camera or object motions, among other things. The aim of general world models will be to represent and simulate a wide range of situations and interactions, like those encountered in the real world.

“Open AI world simulator SORA is a dead end!” said Yann LeCun. According to him, “Modelling the world for action by generating pixels is as wasteful and doomed to failure.” While this stance may seem very bold, it resonates with the fact that a task that requires rational decision-making like driving needs more than just generating pixels.

What is Nature-inspired AI?

“The missing link to generalised autonomy is biological intelligence”, says Gagandeep Reehal, co-founder of Minus Zero. Around this thesis, we introduced Nature-inspired AI,

a bio-inspired framework for building autonomous agents by inducing decision-making capabilities for generalised locomotion, which is the most primitive capability of any intelligent species. Nature-inspired AI is our approach to building World Models for generalised locomotion.


Comprising three core components - Representation, Prediction and Adaptation, this enables agents to develop a unified understanding of how the world behaves, capturing not just the dynamics of the world, but also the dynamics of the different agents, their interactions and the outcomes of ego’s actions, to give an interpretable decision and eventually update that understanding with observations of it outcomes.


We humans have a world model that we build by observing the world to develop intuitive understanding and reasoning capabilities to guide on what is likely, what is plausible, and what is impossible. NIA’s representation is a cognitive model of how the world works by capturing causality and belief systems for understanding the environment, agent’s intent and behaviours to make explainable decisions.


Similar to our brain’s capabilities to simulate inference, behaviour and interactions between environment and other agents, prediction allows the model to estimate missing information about the world not provided by perception, predict plausible future states of the world, and optimise its own intent for the desired consequences.


Once the agent makes a prediction about how it would like the world to be and aligns its intent with the same, it optimises the variables adapted towards a specific domain it’s operating in, like driving a car or playing a game, for the actor to come up with a most optimal action. Then it acts on the feedback from the actor and senses to continuously learn on the go minimising discrepancies between intents or beliefs and environmental evidence to evolve the world model.

These components encompass a bio-inspired world model that can be further clubbed with action and sensing to perform tasks like driving. It further can be specialised to any specific domain by incentivising the relevant actions.

How does World Models solve Generalised Autonomy?

As predicting future events is a fundamental and critical aspect of autonomous systems, world models can be leveraged for making informed and outcome conscious decisions while driving. This give autonomous vehicles a sixth sense  (or let’s say common-sense) —  the ability to not just see the world, but to understand it and anticipate the future.

Thus for autonomous driving, world models would probably surpass traditional AI models due to various factors that are discussed below -

Generalisation to handle tricky or unseen scenarios: 

World models understand the true nature of the environment, including physics, and are able to predict the consequences of their actions. Thus they are not confused by unseen scenarios, like a new variation of an obstacle, allowing them to make the safest decision.

Inherent Explainability

Unlike black-box models, world models offer a certain level of transparency due to their causal representation of the world dynamics. This allows for easier identification of potential biases and ultimately higher trust in their decision-making process. 

Reduces data dependency for Scalability:

Having ‘common-sense’ allows models to generalise well from limited data samples like a human by simulating the diverse variations in environment before tackling the complexities of the real world. This makes them less prone to data distribution bias, and doesn’t require too much labelled samples allowing them to scale quickly across geographics, scenarios, etc. Moreover, there's no need for continually increasing model size to achieve better performance.

Enhanced Safety:

Generalisation allows world models to develop a richer understanding of the environment, that clubbed with reasoning capabilities and causality driven actions help world models to take accurate, precise, informed and safer decisions while driving. Further Explainability of actions helps to build trust for using these models for crucial tasks like driving, etc.     


The Road Ahead

The emergence of world models in the context of autonomous vehicles opens up an exciting future ahead. Nature-inspired AI proposes the most plausible way towards world models inspired by biological intelligence. While the methods to achieve this might vary, it calls for a strong shift in AI research from supervised methodologies towards learning through observation and exploration. In order to perfect these components: representation, prediction & adaptation, we are focusing on a lot of research efforts on spatio-temporal scene understanding, model interpretation, physics-guided networks, context-awareness, and multiple other domains at Minus Zero.


bottom of page