Nvidia has introduced Nemotron 3 Ultra, a large open model the company says is designed to help AI agents reason through long-running tasks more efficiently. The model is aimed at workflows where systems must plan, use tools, delegate subtasks, and keep context across many turns without losing focus or driving up costs.
The company describes Nemotron 3 Ultra as a 550 billion parameter mixture-of-experts model with 55 billion active parameters. Nvidia says the model is intended for the most demanding parts of agent orchestration, including coding sessions that stretch over time, research tasks that involve large sets of sources, and technical verification work with many constraints.
Nvidia positions the release as part of a broader shift in AI systems, from single-turn chatbots to agents that maintain memory and carry out multi-step jobs. In those settings, token usage can increase quickly as models repeatedly absorb tool outputs, prior reasoning, and new instructions. Nvidia says the new model is meant to help separate the high-level reasoning layer from lower-cost execution models used for routine tasks.
The company says Nemotron 3 Ultra is faster than comparable open models in its category, claiming up to 5 times higher throughput. Nvidia also says the model reduced the number of tokens needed in experiments on benchmarks such as SWE-bench and Terminal Bench 2.0, which in turn lowered task completion costs by as much as 30%.
Nvidia highlighted benchmark results across a range of agent and reasoning tests, including long-horizon planning, coding, instruction following, knowledge work, professional tasks, and long-context performance. In its material, the company says the model achieved strong results on long-context workloads and maintained competitive accuracy while improving output speed.
To support those gains, Nvidia points to several architectural choices. The model uses a hybrid design that combines Mamba and transformer layers, which Nvidia says can improve efficiency on long contexts while preserving recall of specific information. It also uses LatentMoE for more efficient expert routing and multi-token prediction to speed generation during longer outputs.
Another major piece of the release is NVFP4 precision, which Nvidia says allows the same checkpoint to run across Hopper, Blackwell, and Ampere GPUs. The company says that approach can deliver up to 5 times higher throughput per GPU at the same level of interactivity compared with BF16 on Blackwell hardware.
Nvidia is also emphasizing the training process behind the model. Nemotron 3 Ultra uses a method the company calls Multi-Teacher On-Policy Distillation, or MOPD. Under that approach, the model learns from more than 10 specialized teacher models while generating its own outputs during training. Nvidia says the setup helps improve domain-specific reasoning more efficiently and allows capabilities to be refined over time.
The company says the launch includes new training data and reinforcement learning resources as well. According to Nvidia, Nemotron 3 Ultra builds on a 10 trillion token pre-training base and adds 212 billion new tokens focused on legal, general knowledge, and code data. Nvidia says it is also releasing 10 million new supervised fine-tuning samples, 1 million new reinforcement learning tasks, and 15 new RL environments.
Nvidia says the model can be fine-tuned with low-rank adaptation, supervised fine-tuning, or reinforcement learning using its NeMo tools. It also points developers to deployment recipes designed for agentic systems.
The release reflects Nvidia’s continuing push into open models for enterprise AI, with a focus on agent infrastructure rather than chatbot-style interaction. By combining large-scale reasoning with claims of higher throughput and lower task costs, Nvidia is aiming Nemotron 3 Ultra at developers building systems that need to work across many steps, tools, and domains.