The world’s top artificial intelligence groups are stepping up their focus on so-called world models that can better understand human environments, in the search for new ways to achieve machine “superintelligence”.
Google DeepMind, Meta and Nvidia are among the companies attempting to gain ground in the AI race by developing systems that aim to navigate the physical world by learning from videos and robotic data rather than just language.
This push comes as questions rise about whether large language models — the technology that powers popular chatbots such as OpenAI’s ChatGPT — are reaching a ceiling in their progress.
The leaps in performance between LLMs released by companies across the sector, such as OpenAI, Google and Elon Musk’s xAI, have been slowing, despite the vast sums invested in their development.
The potential market for world models could be huge, almost the size of the global economy, according to Rev Lebaredian, vice-president of Omniverse and simulation technology at Nvidia, as it brings the technology into the physical domain, such as the manufacturing and healthcare sectors.
“What is the opportunity for world foundation models? Essentially . . . $100tn if we can make an intelligence that can understand the physical world and operate in the physical world,” he said.
World models are trained using data streams of real or simulated environments. They are viewed as an important step in pushing forward progress in self-driving cars, robotics and so-called AI agents, but require a huge amount of data and computing power to train and are considered an unsolved technical challenge.
This focus on an alternative approach to LLMs has become visible as several AI groups have unveiled a series of advancements in world models in recent months.
Last month, Google DeepMind released Genie 3, which generates video frame by frame and takes past interactions into account. Previously, video generation models have typically created the entire video at once, rather than step-by-step.
“AI . . . remains very much limited to the digital domain,” said Shlomi Fruchter, co-lead of Genie 3 at Google DeepMind. “By building environments that look like or behave like the real world, we can have much more scalable ways to train the AI . . . without the real implications of making a mistake in the real world.”
Meta is attempting to replicate how children learn passively by observing the world around them, training its V-JEPA models on raw video content.
Its Facebook Artificial Intelligence Research (Fair) lab, led by Meta chief AI scientist Yann LeCun and focused on longer-term AI projects, released its second version of the model in June, which it has been testing on robots.
LeCun, considered one of the “godfathers” of modern AI, has been one of the most vocal proponents of the new architecture, warning that LLMs would never achieve the ability to reason and plan like humans.
Despite this, Meta’s chief, Mark Zuckerberg, has recently increased investment in top AI talent, with an elite team now pushing to make breakthroughs on its next Llama LLM models. This has included hiring Alexandr Wang, the founder of data labelling group Scale AI, to head all of Meta’s AI work, with LeCun now reporting to Wang.
One near-term application of world models is in the entertainment industry, where they can create interactive and realistic scenes. World Labs, a start-up founded by AI pioneer Fei-Fei Li, is developing a model that generates video game-like 3D environments from a single image.
Runway, a video generation start-up that has deals with Hollywood studios, including Lionsgate, launched a product last month that uses world models to create gaming settings, with personalised stories and characters generated in real time.
“Traditional video methods [are a] brute-force approach to pixel generation, where you’re trying to squeeze motion in a couple of frames to create the illusion of movement, but the model actually doesn’t really know or reason about what’s going on in that scene,” said Cristóbal Valenzuela, chief executive officer at Runway.
Previous video-generation models had physics that were unlike the real world, he added, which general-purpose world model systems help to address.
To build these models, companies need to collect a huge amount of physical data about the world.
San Francisco-based Niantic has mapped 10mn locations, gathering information through games including Pokémon Go, which has 30mn monthly players interacting with a global map.
Niantic ran Pokémon Go for nine years and, even after the game was sold to US-based Scopely in June, its players still contribute anonymised data through scans of public landmarks to help build its world model.
“We have a running start at the problem,” said John Hanke, chief executive of Niantic Spatial, as the company is now called following the Scopely deal.
Both Niantic and Nvidia are working on filling gaps by getting their world models to generate or predict environments. Nvidia’s Omniverse platform creates and runs such simulations, assisting the $4.3tn tech giant’s push towards robotics and building on its long history of simulating real-world environments in video games.
Nvidia chief executive Jensen Huang has asserted that the next major growth phase for the company will come with “physical AI”, with the new models revolutionising the field of robotics.
Some such as Meta’s LeCun have said this vision of a new generation of AI systems powering machines with human-level intelligence could take 10 years to achieve.
But the potential scope of the cutting-edge technology is extensive, according to AI experts. World models “open up the opportunity to service all of these other industries and amplify the same thing that computers did for knowledge work”, said Nvidia’s Lebaredian.
Additional reporting by Melissa Heikkilä in London and Michael Acton in San Francisco