Large language models (LLMs) have demonstrated remarkable abilities in processing and generating human-like text, but their limitations in understanding the physical world are becoming increasingly apparent. As AI ventures beyond digital spaces and into robotics, autonomous driving, and manufacturing, a new approach is gaining traction: the development of “world models.” These models aim to provide AI systems with an internal simulation of reality, enabling them to predict outcomes and interact with their environment more effectively. Recent investment signals a major shift in this direction, with AMI Labs securing $1.03 billion in seed funding shortly after World Labs raised $1 billion, according to reports from VentureBeat and TechCrunch.
The core challenge lies in the fact that LLMs excel at predicting the next word in a sequence but lack a fundamental understanding of physical causality. As Turing Award recipient Richard Sutton warned in an interview with podcaster Dwarkesh Patel, these models often “mimic what people say instead of modeling the world,” hindering their ability to learn from experience and adapt to changing conditions. This limitation results in “brittle behavior,” where even minor changes to input can cause significant errors, as noted by Google DeepMind CEO Demis Hassabis, who described current AI as exhibiting “jagged intelligence” – capable of complex tasks like solving math problems but failing at basic physics.
Three Approaches to Building World Models
To overcome these limitations, researchers are focusing on building world models that act as internal simulators. However, the field isn’t unified; several distinct architectural approaches are emerging, each with its own strengths and weaknesses. These can be broadly categorized into three main strategies.
JEPA: Real-Time Efficiency Through Latent Representations
The first approach, championed by AMI Labs, centers on learning latent representations rather than attempting to predict the dynamics of the world at the pixel level. This method is based on the Joint Embedding Predictive Architecture (JEPA). JEPA models aim to replicate how humans perceive the world – focusing on essential features and discarding irrelevant details. For example, when observing a car, we track its trajectory and speed, not the precise reflection of light on every leaf. trendingtopics.eu highlights this efficiency, noting that JEPA requires fewer training examples and operates with lower latency.
This efficiency makes JEPA well-suited for applications demanding real-time inference, such as robotics, self-driving cars, and high-stakes enterprise workflows. AMI Labs is already partnering with healthcare company Nabla to leverage this architecture for simulating operational complexity and reducing cognitive load in fast-paced medical settings. Yann LeCun, co-founder of AMI, explained in an interview with Newsweek that JEPA-based world models are designed to be “controllable,” achieving goals by construction.
Gaussian Splats: Building 3D Spatial Environments
A second approach utilizes generative models to construct complete spatial environments from scratch. Companies like World Labs employ this method, taking an initial prompt – an image or text description – and generating a 3D scene using Gaussian splats. These splats represent 3D geometry and lighting with millions of tiny particles, allowing for direct import into physics and 3D engines like Unreal Engine for interactive exploration. This drastically reduces the time and cost associated with creating complex 3D environments. World Labs founder Fei-Fei Li has pointed out that LLMs are often “wordsmiths in the dark,” lacking the spatial intelligence and physical experience that Gaussian splats provide. Their Marble model aims to bridge this gap.
Whereas not designed for split-second execution, this approach holds significant potential for spatial computing, interactive entertainment, industrial design, and robotics training. Autodesk’s investment in World Labs underscores the enterprise value of integrating these models into industrial design applications.
Conclude-to-End Generation: Scaling Synthetic Data
The third approach involves end-to-end generative models that continuously generate scenes, physical dynamics, and reactions in real-time. Models like DeepMind’s Genie 3 and Nvidia’s Cosmos fall into this category, offering a simplified interface for creating interactive experiences and vast amounts of synthetic data. DeepMind demonstrated Genie 3’s capabilities by showcasing its ability to maintain object permanence and consistent physics at 24 frames per second without a separate memory module. Nvidia Cosmos leverages this architecture to scale synthetic data for physical AI reasoning, enabling the creation of rare and dangerous scenarios for autonomous vehicle and robotics development, as Waymo has done with its self-driving car training.
However, this end-to-end method demands significant computational resources to render physics and pixels simultaneously. Despite the cost, it’s considered necessary to achieve the deep understanding of physical causality that Hassabis believes is crucial for safe and reliable AI operation in the real world.
The Future of AI: Hybrid Architectures
LLMs will likely continue to serve as the primary interface for reasoning and communication, while world models establish themselves as the foundational infrastructure for physical and spatial data pipelines. As these models mature, hybrid architectures that combine their strengths are emerging. For example, cybersecurity startup DeepTempo has developed LogLM, integrating elements of LLMs and JEPA to detect anomalies and cyber threats from security logs.
The development of robust world models represents a critical step towards creating AI systems that can truly understand and interact with the physical world. Further research and development in this area will undoubtedly shape the future of AI across a wide range of industries.
What are your thoughts on the potential impact of world models? Share your comments below.