The Allen Institute for AI has released a new model called MolmoMotion that predicts how objects will move in 3D space after being shown a video frame, a set of query points and a written action description. The system is designed to forecast motion rather than merely observe it, with possible uses in robotics planning and controllable video generation.

MolmoMotion takes as input an RGB observation, points marked on an object and a natural-language instruction such as moving or rotating an item. It then estimates the future 3D trajectories of those points over the next few seconds. According to the researchers, the model outperforms existing motion forecasting approaches across a range of scenes and tasks.

The release includes more than just the model itself. Allen AI is also publishing MolmoMotion-1M, which it says is the largest collection of action-described 3D point trajectories assembled so far. The dataset was built from 1.16 million videos and includes trajectories spanning 736 motion types and 5,600 distinct objects. The team is also releasing PointMotionBench, a human-validated evaluation set with 2,700 clips intended to measure object-centric 3D motion forecasting.

How the system works

MolmoMotion uses a representation based on sparse 3D points attached to objects in world space. The researchers say this format is meant to be class-agnostic, stable across viewpoints and useful for downstream systems that need to reason about physical motion. Because the predicted paths are explicit 3D trajectories, they can be passed to other systems rather than reconstructed from pixels.

The model is built on the Molmo 2 backbone, which helps connect language instructions with objects and points in an image. The system looks at short video history, the action description and the starting coordinates of query points, then forecasts where those points are likely to move.

Allen AI trained two versions of the model. One, called MolmoMotion-AR, generates future coordinates step by step in a text-like format. The other, MolmoMotion-FM, predicts trajectories in continuous space and is meant to handle uncertainty when more than one future is plausible.

To build the training data, the team created an automated pipeline for extracting object-grounded 3D trajectories from ordinary videos. It filtered out noisy tracks, smoothed the remaining motion and trimmed clips to the periods when the object was actually moving.

Performance across tasks

On PointMotionBench, MolmoMotion performed better than the other methods the researchers tested, including pixel-based video generators, parametric 3D methods and a constant-velocity baseline. The examples cited in the report include motions such as a lint roller moving across cloth, a bowl sliding and rotating on a table, a flamingo walking while dipping its beak into water and a car following a road that bends to the right.

The team also tested whether the model’s motion understanding could help robots. After fine-tuning on DROID, a large dataset of robot manipulation videos, MolmoMotion predicted object paths across different objects, camera angles and tasks. In simulation, a control policy built on MolmoMotion succeeded on 76.3% of pick-and-place tasks, compared with 56.0% for the same policy using Molmo 2. The MolmoMotion-based policy also improved training speed, reaching 51% success after 10,000 steps, while the Molmo 2 version reached 19%.

The researchers said the model can also guide video generation by supplying predicted motion paths to an image-to-video system. In their tests, this improved motion quality on all five motion-related metrics they measured, and in four of those five cases it beat a larger image-to-video model.

Allen AI says the model weights, dataset and benchmark are being released publicly for research and further development. The team also noted limitations, including the use of only eight query points per object during training, which makes the system less suited to dense or highly deformable motion.