Fei-Fei Li proposes a clearer framework for world models

Fei-Fei Li has published a new essay arguing that the term "world model" has become too broad to be useful without a clearer breakdown of what different systems actually do. In the piece, the Stanford professor and World Labs founder says the field needs a functional taxonomy that separates world models into three main roles: renderers, simulators and planners.

The essay comes as AI labs across computer vision, robotics, reinforcement learning and generative media increasingly describe their systems as world models. Li argues that the label now covers tools with very different purposes, from video generators that create convincing imagery to physics engines that support engineering and robot training. Her central point is that these systems are all linked to the same underlying loop of action, state and observation, but each one produces a different output.

Three roles, one loop

Li roots the discussion in the classic reinforcement learning framework used in partially observable Markov decision processes, or POMDPs. In that setup, an agent acts on the world, the world changes, and the agent receives partial observations before deciding what to do next. She says world models should be understood in relation to that loop.

A renderer, in her framing, produces observations, usually in the form of pixels. Its priority is visual realism. A simulator, by contrast, generates structured state that is geometrically and physically faithful enough for humans and software systems to compute on. A planner outputs actions, deciding what an agent should do given a goal and a view of the world.

Li says these categories help explain the current state of AI development. Video generation systems can make scenes that look plausible without preserving true physical behavior. Simulation tools are better suited for architecture, robotics and autonomous systems, where correctness matters more than appearance. Planning systems, including vision-language-action models and newer world action models, focus on turning perception into decisions.

Why simulation matters most

Although renderers are the most commercially mature, Li argues that simulation is the most consequential layer. She says simulation acts as the bridge between visual generation and action planning because it captures geometry, physics and dynamics. A model that understands a cup well enough to render it from any angle should, in principle, also be able to simulate what happens when the cup is pushed and help plan how a robot might pick it up.

That potential makes simulation central to industries such as robotics, digital twins, autonomous vehicles, engineering and drug discovery. At the same time, Li notes that the field faces major bottlenecks. High-quality 3D data is much scarcer than internet video, the gap between simulated and real-world behavior remains difficult to close, and multi-physics environments are expensive to build and run.

Li points to World Labs' Marble as an early effort in this direction. The system takes prompts in text, images, video or spatial sketches and generates explorable 3D environments, including both visual representations and collision meshes that can be used in physics engines.

Toward unified world models

The broader trend, according to Li, is convergence. Renderers are becoming more interactive, simulators are becoming more editable and planners are becoming more deliberate. She suggests that the most important systems will eventually combine all three functions in a single foundation model that can produce photorealistic views, physically accurate structure and action sequences depending on the task.

Li says that goal remains difficult because the data and optimization needs of rendering, simulation and planning do not line up neatly. Still, she argues that the field is moving toward a unified understanding of world models, one that could eventually help AI systems see, build and act in the world with greater reliability.