How world models became AI's next frontier

wpnews.pro

arge language models can do a lot of things, as long as those things exist on a screen.

Having consumed practically all of the data on the internet, these models can tell you something about practically anything. They can analyze thousands of documents and pull out the most important parts. They can plan your trips, write poems, and mimic therapists in times of need. They can help researchers understand anything from protein structures to ancient history. And they’re the backbone of millions of agents that are the hottest ticket in tech right now.

But the world is bigger than a screen, and understanding it requires more than words.

That's why Nvidia’s Jensen Huang has said more than once that physical AI is due for its "ChatGPT moment." It’s also the reason that world models, or AI models capable of understanding the physical environment, have gained significant momentum in 2026. Along with a flock of young startups entering the space, some of AI’s most prominent figures have homed in on the concept.

But as the momentum around world models grows, in tandem with physical AI and robotics, some leaders have begun to question their impact on LLMs, and ultimately, the ever-elusive path to artificial general intelligence (AGI).

"Seeing the world in a profound way, in a way that you participate in your movement, your interaction, in your communication, is critical for intelligence," said Dr. Fei-Fei Li, largely considered the godmother of AI, while also being the founder and CEO of World Labs, in a panel at the HumanX conference in April. "Not having that is intelligence in the dark."

Investors take notice

As world models catch the attention of some of AI’s biggest tastemakers, investors have become captivated. World model startups have been raking in billions in funding, some at incredibly early stages:

Advanced Machine Intelligence, or AMI Labs, a startup founded by AI godfather and former Meta AI chief Yann LeCun, raised a $1 billion seed funding roundat a $3.5 billion pre-money valuation in March, after launching in late 2025. - World Labs, Li’s startup founded in 2024, announced its own $1 billion funding round in February at a valuation of $5 billion. - Runway, a company focused on scaling video models as a means of creating world models, raised $315 million in Series E fundingat a valuation of $5.3 billion. - Luma, an AI video startup that focuses on creating multimodal reasoning models, raised a $900 million Series C in Novemberat a valuation of $4 billion. - General Intuition, a lab dedicated to building foundation models for environments that require " deep spatial and temporal reasoning," raised$320 million in Series A fundingat a $2.3 billion valuation in June.

Beyond startups, several companies have pivoted into world models from one industry in particular: gaming.

Niantic, the maker of the beloved Pokémon Go app, sold its suite of mobile games to Scopely for $3.5 billion and spun out a lab last March called Niantic Spatial, focused on developing what it calls a "Large Geospatial Model," a world model that enhances spatial reasoning in LLMs.

And Roblox, the online gaming platform with more than 150 million daily active users, is developing its own version of a world model that it calls "real-time dreaming," that allows creators to generate and iterate on virtual environments through language prompts. In a panel at the HumanX conference in April, David Baszucki, Roblox founder and CEO, said that he envisions the company’s world models "not just as a play technology, but as a creation technology as well."

In January, Google DeepMind released Project Genie, an "experimental research prototype" that marks the latest iteration of its work in the world model space. The project is powered by its flagship Gemini model, its Nano Banana Pro image model, and Genie 3, its most powerful world model yet.

Nvidia, meanwhile, unveiled its Cosmos model at CES 2025, a world foundation model that’s aimed at accelerating the development and deployment of autonomous vehicles and robots. Since then, the company has expanded Cosmos further, including debuting the third generation of the Cosmos family and releasing world-generation models, controllable simulations for synthetic data generation, and multimodal reasoning models for physical AI.

Many bets, same problems

The challenge with the term "world model" is that it means different things to different people.

During her HumanX panel, Li said that defining world models in 2024 was much easier. "No one was talking about this," she said. "I'm keenly aware that different groups have different definitions of world models at this point."

Broadly speaking, building a world model involves creating machines with the ability to understand space, she explained, such as reasoning about geometry, interactivity and physics, as well as generating those spaces "just like our computers today generate words."

And as Li said, there are several different ideas about how to achieve this.

For starters, there are predictive video models that learn to forecast future states from video data. This is the method that Nvidia’s Cosmos models and Google’s Genie models have chosen. Anastasis Germanidis, co-founder and CTO of Runway, an AI video company, told The Deep View that it is also staking its world model on "interactive video," which is how it built GWM-1. “We've seen that the most important thing for making great world models is having great video models,” Germanidis said.

World Labs, Li’s company, founded in 2024, focuses on spatial intelligence. Rather than homing in on video generation, World Labs is focused on 3D world representation as a spatial object, not a sequence of frames. Marble, its first world model, aims to create geometrically-accurate, consistent and navigable environments that you can interact with, edit and reuse.

*An example of a simulation from World Labs' Marble software. *

Simulation models, such as robotics, gaming, or autonomous driving simulators, also fall into the world models bucket, simulating environments that agents can act within and learn from. For instance, this kind of model is the backbone of Tesla’s self-driving technology and the Waymo World Model. Robotics models, such as vision-language-action models like Figure AI’s Helix, can also fall into this category.

Multimodal models can also fall into the world models category. That includes the approach taken by Luma, an AI video startup focused on creating "multimodal general intelligence," or models that can understand the world by taking in a wide variety of data, such as images, videos, audio and text.

"Our bet on world models is that you train models jointly on all of these modalities, just the same way as our brains work, rather than like saying, '3D is the path to world models,' or 'video is the path to world models,'" Caroline Ingeborn, COO of Luma, told The Deep View. "If you're going to have models that understand and can generate and operate in our worlds, you need to be trained across modalities."

Despite the different approaches to creating these models, they all face the same two constraints that plague everyone building AI: compute and data. But because these models digest and create more than simple words, the problems are amplified.

For one, these models are far more compute intensive than their language-only predecessors. And amid an unprecedented crunch for compute, the entire industry is feeling the squeeze. That's why OpenAI completely axed its own AI video platform, Sora, to free up more compute for its flagship models. Li even quipped about this problem in her HumanX keynote, saying that her "friends in Nvidia and AMD will be very happy" following her company’s most recent funding announcement, implying that World Labs would be buying a lot more compute. As for data, these models require more than just what can be scraped from large repositories of text. Similar to the data shortages in the robotics industry, these models need large amounts of high-quality data from real-world environments to learn to recreate them. Getting these models to behave consistently and create physically accurate data largely relies on this.

“There's never enough compute, and there's never enough data,” said Germanidis.

Today's primary use of world models

These various world models may be able to solve each other's problems. As it stands, the most commercially significant use case for these models is generating synthetic data, which is then used to train robotics or self-driving vehicles.

Deepu Talla, VP of robotics and edge AI at Nvidia, told The Deep View that these kinds of models are critical to filling the data gap plaguing physical AI development and are the primary use case for Nvidia’s Cosmos models. Though many developers are on the quest to build a "general-purpose robotic brain," Talla said that the brain itself may not be enough to create robots that can seamlessly act on the world around them.

"A robot can only do so much if you don't have the environment for the robot to interact with," said Talla. "The world model is about creating the whole environment around it to do that simulation."

Li echoed a similar sentiment for the work of World Labs, noting in her keynote that the output "contributes to further providing synthetic data for others in the ecosystem."

This application is vitally important because it can create a symbiotic relationship between physical AI systems and the world models used to build them. The better physical AI systems perform, the more they can collect real-world data to help train the models.

World models’ synthetic data also helps break down a major barrier that physical AI systems face: safety. While a self-driving data set may have a sizable pool of info on typical driving situations, "the vast, vast majority" is going to be non-accidents, Germanidis said. However, "The place where their models need to perform the best — that it's critical to perform the best — is exactly in those [dangerous] moments."

But an AI model that can comprehend the world beyond words and has an understanding of space and physics has practically limitless use cases, said Ingeborn. Physically accurate AI models can be transformative for the entertainment industry, such as filmmaking, VFX and editing, she said. These models may also be groundbreaking for the gaming industry, allowing developers to build game environments without requiring the resources of a large publisher.

And beyond entertainment, these models have potential in science as well. In her 2025 essay, “Spatial Intelligence is AI’s Next Frontier,” Li notes that an understanding of the physical world is crucial for materials discovery and climate science by integrating multi-dimensional simulation with real-world data collection. These models also have the potential to accelerate drug discovery in medicine and act as companions in healthcare settings.

"If you're doing an open heart transplant, you need to understand the physical world," Ingeborn said. "Imagine having an AI that can understand and operate alongside you."

The AGI challenge

But like most things in AI, all roads lead back to artificial general intelligence. Many in the world model space firmly believe that spatial understanding is essential to achieving the nebulous and lofty goal of human-level machine intelligence.

"World models are kind of the only path to AGI in my mind," said Germanidis. "There is no way to get to general intelligence unless you understand the real world."

While there are many different ways to define artificial general intelligence — as well as fiercely debated questions about whether achieving it is possible or if it’s a worthwhile goal to begin with — it’s also the core mission that underlies practically every major frontier AI lab. Let’s start with Anthropic. The AI giant has largely focused on large language models as the gateway to achieving massively powerful AI. Though CEO Dario Amodei has shunned the term AGI, in his renowned essay “Machines of Loving Grace,” he has defined powerful AI as that with skill levels "exceeding that of the most capable humans in the world," which can autonomously complete tasks and perform complex science and engineering. To achieve this definition, Amodei and Anthropic have bet on the paradigm of scaling laws: More compute and more data will, someday, lead to a language model with these capabilities.

OpenAI holds a similar view. More comfortable with the term AGI itself, the company has pledged itself to the long-term mission of creating general intelligence, or "highly autonomous systems that outperform humans at most economically valuable work," that benefit "all of humanity." CEO Sam Altman is bullish on the idea that this will arrive sooner than later, with scaling laws being the fundamental method of getting there.

But many in the industry have started to question whether language models are the key to unlocking this higher level of intelligence. Since the dawn of the concept of world models, experts have started to poke holes in the idea that language alone will get us there. One such voice is LeCun.

On multiple occasions, LeCun has argued that language models will not lead us to human-level machine intelligence, as language offers only a limited slice of intelligence, restricted by the collections of letters and numbers humanity has written down. Without an understanding of environments, perception and interaction, these models are essentially untethered from reality, he asserts.

And the argument has its merits. Evolutionarily, humans developed the ability to communicate as a means of comprehending our other five senses. Without an understanding of those senses, or an understanding of physical consequence, how can a machine understand what it is to be human?

"With this whole idea of AGI and a more general intelligence, it has become more clear that you cannot just get that from text,” Amit Goel, Nvidia's head of robotics and edge computing, told The Deep View. "For that level of intelligence, you need to consume information from the world. That is how you get to the next level."

Where we go next

As they pursue the vision of the ever-powerful, superhuman language model, between OpenAI and Anthropic alone, the companies have garnered a total of nearly $330 billion in funding and counting, each with valuations nearly scraping the trillion-dollar mark.

But the question remains: With the growing momentum toward world models and spatial intelligence, what is left for LLMs?

Germanidis said there is increasingly a recognition in the industry that language model developers are "saturating the data we have available." As a result of approaching that limit, he believes that language model improvement will soon, too, hit a saturation point, making these models unable to solve "core problems that we care about."

While all of the text available in human history would take humans thousands of years to read through, that data is still just a small fraction of the real-world data that humans perceive with our senses over time. “There is this broad set of data that's about the real world that actually we haven't trained on, and we haven't built systems around,” said Germanidis. “It's the biggest next opportunity for AI.”

However, what these models are capable of and how we actually use them are two different things. The issue of capability overhang, or the gap between the capabilities AI can offer and the capacity to which we actually use them, still looms among businesses and consumers alike. And as it stands, a relatively small fraction of people actually utilize AI on a day-to-day basis — just over 16%, according to a Microsoft study.

The fact that such a small part of the population is leveraging language models to their fullest extent might be proof that, even if language models run out of words to eat and reach a capability plateau, there may be a long, long way to go before we run out of ways to leverage them, said Ingeborn.

Luis Lastras, director of language technologies at IBM, told The Deep View that enterprises are still figuring out how to leverage language models in reliable, trustworthy and efficient ways.

While physical AI and world models present a lot of promising applications, "when it comes to productive use of enterprises, I would say there's a lot that we can build on the current platform that is not really fully fleshed out, especially when it comes down to efficient automation," Lastras said. Additionally, Nvidia’s Goel said these models may work hand-in-hand. Rather than being pitted against each other, languages are simply vectors used as the communication layer for world models and spatial intelligence. "I see it as a continuum," said Goel. "This new wave is building on top of LLMs."

The reality is that physically accurate world models and spatial intelligence have a long, long road ahead before they actually realize the lofty goals of Li, LeCun, and others. Along with confronting the same barriers of data and compute that frontier language model labs are facing, these models are simply more difficult to build than those that are language-only.

It’s hard to create a model that holds an internal representation of our world, one that consistently understands and applies physics in its generation. Just like it’s exceedingly difficult to create a model that transcends the smartest humans in the world in every conceivable domain, the goal of AGI, no matter how close AI figureheads claim we are.

By the time highly capable world models are ready to become a critical component of the technology ecosystem, language models may have already run their natural course. And the grand AGI question will have to sort itself out along the way.

source & further reading

thedeepview.com — original article How Apple's decade-long bet on chips won AI US government clears Mythos, AI expectations shift Sonnet 5 is Anthropic's answer to AI sticker shock

How world models became AI's next frontier

Run your AI side-project on zahid.host