Alibaba’s Qwen team has unveiled a new robot-focused model suite aimed at narrowing one of robotics’ biggest gaps: turning perception and reasoning into physical action.

The company said the Qwen-Robot Suite is built around three foundation models, each targeting a different part of embodied intelligence. Qwen-RobotNav is designed for navigation, Qwen-RobotManip for manipulation, and Qwen-RobotWorld for predicting how environments change over time. Together, the models are meant to help agents move from simply understanding scenes to acting within them.

The announcement comes as large multimodal models have improved at describing the physical world, but still struggle to control robots in it. Qwen said that vision-language systems can already interpret spatial relationships, identify objects in cluttered scenes and follow visual instructions. But those capabilities do not automatically translate into motor commands, which remain tied to separate action spaces and robot embodiments.

Three models, three tasks

Qwen-RobotNav focuses on mobility. The company said it combines language and vision into a parameterized navigation interface that can handle task types including instruction following, object search, tracking and autonomous driving. Qwen said the model was trained on 15.6 million samples and can adjust how it uses visual history through settings such as token budget, time decay and camera weighting.

In its report, the company said the navigation model achieved strong results across five benchmark domains and was also used as part of a larger agent system. In that setup, a higher-level planner can break down long tasks and call the navigation model repeatedly, adjusting its mode and memory strategy as the episode unfolds.

Qwen-RobotManip is intended to bridge language understanding with robot-arm control. The company said the system uses a shared 80-dimensional state-action representation so that single-arm, dual-arm, dexterous-hand and mobile robot data can be trained together more effectively. It also uses camera-frame delta pose actions, which Qwen said makes similar motions look more alike across different robots. According to the company, the model was trained on more than 38,100 hours of open-source and synthesized data spanning 15 embodiments.

Qwen said the manipulation model showed gains on a range of generalization tests, including tasks involving unseen environments, cross-embodiment transfer and recovery from perturbations.

Qwen-RobotWorld is the most forward-looking part of the suite. Rather than focusing on immediate control, it is designed as a world model that predicts what comes next after a natural-language action. Qwen said that framing actions in language allows a single system to cover tasks across navigation, manipulation and driving, while learning the transition dynamics of physical environments.

Building blocks for embodied agents

The company is positioning the suite not only as three separate research projects, but also as infrastructure for broader agent systems. Qwen described the models as modular tools that can be combined with planners to support long-horizon tasks and more flexible behavior in real settings.

Qwen also highlighted deployment scenarios beyond simulation. For Qwen-RobotNav, the company said it demonstrated zero-shot use on a Unitree Go2 quadruped equipped with a low-resolution built-in camera. In the reported tests, the robot followed spoken instructions in an unfamiliar apartment setting.

The release reflects a broader push in AI toward systems that can operate in the physical world rather than only generate text or images. For robotics, that means not just recognizing objects and places, but aligning language, vision and control in ways that can scale across different machines and tasks.

With the Qwen-Robot Suite, Alibaba is betting that the next step in foundation models is not just seeing the world more clearly, but learning how to act in it.