{"slug": "nvidia-nemotron-3-ultra-powers-faster-more-efficient-reasoning-for-long-running", "title": "NVIDIA Nemotron 3 Ultra Powers Faster, More Efficient Reasoning for Long-Running Agents", "summary": "NVIDIA released the Nemotron 3 Ultra, a 550-billion-parameter open model designed to accelerate reasoning and reduce costs for long-running AI agents. The model achieves up to five times higher throughput than comparable systems while lowering token consumption and operational expenses by as much as 30%. The release addresses the growing computational demands of multi-agent workflows, where continuous tool calls and context accumulation drive up costs and risk goal drift.", "body_md": "Single-turn chatbots are evolving into long-running [agents](https://www.nvidia.com/en-us/glossary/ai-agents/) that can reason, maintain context, use tools, and run efficiently across many turns to complete complex workflows.\n\nHowever, these [multi-agent](https://www.nvidia.com/en-us/glossary/multi-agent-systems/) workflows cause token counts to grow quickly. Agents plan, call tools, invoke sub-agents, receive information, and then pass history, outputs, and [reasoning](https://www.nvidia.com/en-us/glossary/ai-reasoning/) steps back into the model continuously. As tasks run longer, this constant communication increases costs and the risk of goal drift.\n\nDevelopers can solve this using a system of models: [frontier reasoning models](https://www.nvidia.com/en-us/glossary/frontier-models/) for orchestration and complex planning, and efficient models for high-volume execution, validation, and tool calling.\n\nNVIDIA is releasing NVIDIA Nemotron 3 Ultra, an open model built to help long-running agents complete tasks faster while lowering cost.\n\n## Nemotron 3 Ultra for agent orchestration\n\nNemotron 3 Ultra is a 550B-parameter Mixture-of-Experts model with 55B active parameters, built for frontier reasoning and orchestration in agentic systems.\n\nWithin any agent workflow, most calls are routine, but a critical subset demands deeper reasoning. Nemotron 3 Ultra is built to handle these hard calls: sustaining architectural decisions across coding sessions, synthesizing contradictory evidence across hundreds of research sources, or verifying chip designs across thousands of constraints.\n\n| Nemotron 3 Ultra (550B) | GLM 5.1 (744B) | Kimi K2.6 (1T) | Qwen3.5 (397B) | |\n| Agent Productivity PinchBench | 91% | 84% | 91% | 89% |\n| Long-horizon Planning EnterpriseOps-Gym | 33% | 40% | 29% | 30% |\n| Coding Terminal-Bench 2.0 | 54% | 64% | 67% | 53% |\n| Instruction Following IFBench | 82% | 77% | 74% | 78% |\n| Knowledge Work GDPVal-AA | 1,448 | 1,594 | 1,508 | 1,192 |\n| Professional Work Tasks ProfBench (Search) | 56% | 46% | 56% | 53% |\n| Long Context Ruler @1M | 95% | N/A (max 256K) | N/A (max 256K) | 90% |\n\n*Table 1. Nemotron 3 Ultra delivers frontier accuracy in a smaller model*\n\nNemotron 3 Ultra is also fast. It achieves 5x higher throughput compared to other open models in its class, enabling long-running agents to complete tasks faster and more efficiently.\n\nNemotron 3 Ultra is also built for efficiency. In experiments on the SWE-bench and Terminal bench 2.0, it completed benchmarks using fewer total tokens and fewer tokens per turn than comparable models. This lowers the cost for agentic tasks by up to 30%.\n\n## Breakthroughs powering Nemotron 3 Ultra\n\nTo mitigate the typical efficiency-accuracy tradeoffs for high-capacity reasoning models, the Nemotron models introduce architectural innovations:\n\n**Post-trained for agent harness**\n\nNemotron Ultra is post-trained to deliver consistent accuracy across top harnesses. The model is trained using the NVIDIA [NeMo RL](https://github.com/nvidia-nemo/rl) and [Gym](https://github.com/NVIDIA-NeMo/gym) open libraries with one of the largest suites of long-running, task-solving, tool-using datasets in the world.\n\nUltra is optimized for agent-led open harnesses, not just single-turn chat, and is designed to work within workflows where agents plan, call tools, read observations, delegate to sub-agents, validate outputs, and recover from errors across many turns.\n\n**Hybrid Mamba transformer**\n\nMamba layers improve sequence efficiency for long-context workloads, while Transformer layers preserve precise recall when agents need to retrieve specific facts from large context windows.\n\n**NVFP4 precision**\n\nThe same NVFP4 checkpoint runs on NVIDIA Hopper, NVIDIA Blackwell, and Ampere GPUs. Developers can use one checkpoint across all NVIDIA GPU architectures thanks to specialized NVFP4 quantization kernels. NVFP4 also delivers up to 5x higher throughput per GPU at the same interactivity compared to BF16 on Blackwell.\n\n**LatentMoE**\n\nLatentMoE supports more efficient expert routing, enabling the model to handle workflows spanning reasoning, code generation, tool calls, and domain-specific logic.\n\n**Multi-token prediction**\n\nMulti-token prediction (MTP) helps reduce generation time by predicting multiple future tokens in a single forward pass, improving throughput for long outputs and multi-turn workflows.\n\n## Nemotron 3 Ultra adds Multi-Teacher On-Policy Distillation\n\nMulti-Teacher On-Policy Distillation (MOPD) is a training method in which Ultra learns from multiple specialized teacher models while generating its own attempts during training. More than 10 specialized teacher models are trained, each with its own domain-specific training pipeline. Each teacher scores the model in its area of expertise, helping Ultra improve reasoning across domains more efficiently.\n\nDuring MOPD, the student model generates rollouts across domains and receives dense reward signals from the corresponding teacher models. To maximize efficiency, MOPD runs asynchronously, with student rollout generation, teacher scoring, and student optimization fully pipelined.\n\nMOPD is also iterative. After producing an MOPD-trained checkpoint, new rounds of teacher training are initialized from the updated student model, and the improvements are merged into the next MOPD stage.\n\nThis co-evolution between students and teachers enables continuous capability improvement and progressively stronger specialization across domains. Users can try [MOPD recipes](https://github.com/NVIDIA-NeMo/RL/blob/ultra-v3/docs/guides/nemotron-3-ultra.md) through NeMo-RL, the library that trained the Ultra model.\n\n## Training data for stronger agent reasoning\n\nAs with all Nemotron open model launches, much of the training data pipeline is released as permissively as possible. For partners in enterprise and sovereign AI development, training data transparency and provenance matter as much as capability.\n\n**Domain-specific pre-training data **\n\nBuilding on a 10T token pre-training foundation, Nemotron 3 Ultra adds 212B new tokens targeting three high-value domain gaps:\n\n**4B tokens** of synthetic legal data, increasing the proxy LegalBench average from 64.6% to 74.7%**35B tokens** of synthesized Wiki-based data, boosting proxy SimpleQA from 40.2% to 50.2%**173B refreshed GitHub tokens** through Sept. 30, 2025\n\n**Post-training data and RL environments**\n\nThis launch is also releasing 10M new SFT samples, 1M new RL tasks across multiple domains, and 15 net-new RL environments, bringing the cumulative Nemotron open data totals to 50M SFT samples, 2M RL tasks, and 55 RL environments.\n\nThe result is SWEBench Verified scores between 65% and 70.4% across Pi, OpenHands, Hermes, OpenCode, and Mini SWE Agent—consistent performance regardless of which framework you deploy.\n\n**Finetune for your domain**\n\nNemotron 3 Ultra can be fine-tuned using LoRA, SFT, and reinforcement learning using the NVIDIA [NeMo libraries](https://github.com/NVIDIA-NeMo). Developers can get started with the following recipes.\n\nNemotron 3 Ultra Recipes:\n\n- SFT LoRA: NeMo Automodel (\n[H100 Recipe](https://github.com/NVIDIA-NeMo/Automodel/blob/main/examples/llm_finetune/nemotron/nemotron_ultra_v3_hellaswag_peft.yaml),[GB200 Recipe)](https://github.com/NVIDIA-NeMo/Automodel/blob/main/examples/llm_finetune/nemotron/nemotron_ultra_v3_hellaswag_peft_gb200.yaml) - Full SFT: NeMo Megatron Bridge\n[Recipes](https://github.com/NVIDIA-NeMo/Megatron-Bridge/tree/nemotron_3_ultra/examples/models/nemotron/nemotron_3/ultra) - Reinforcement Learning: NeMo RL\n[GRPO recipe](https://github.com/NVIDIA-NeMo/RL/blob/ultra-v3/docs/guides/nemotron-3-ultra.md),[GRPO LoRA recipe](https://github.com/NVIDIA-NeMo/Nemotron/tree/main/usage-cookbook/Nemotron-3-Ultra),[MOPD recipe](https://github.com/NVIDIA-NeMo/RL/blob/ultra-v3/docs/guides/nemotron-3-ultra.md) - Deployment: Dynamo\n[Recipe](https://github.com/ai-dynamo/dynamo/tree/main/recipes/nemotron-3-ultra)\n\n## See it in action\n\nThis walkthrough shows how to spin up and run an autoresearch flow using Hermes Agent powered by Nemotron 3 Ultra on [build.nvidia.com](https://build.nvidia.com/nvidia/nemotron-3-ultra-550b-a55b).\n\n## Run agents more safely with NVIDIA NemoClaw and NVIDIA OpenShell\n\nNemotron models integrate with leading open agent frameworks. To build a secure, always-on agentic system, it is important to understand the reference stack:\n\n**Hermes Agent and OpenClaw:** These are popular agent harnesses that provide the orchestration loops, memory, and tools for multi-turn workflows. Hermes Agent is now officially available and fully supported for use with Nemotron.**NVIDIA OpenShell:** Available now in early preview, OpenShell is the secure runtime environment (part of the NVIDIA Agent Toolkit) where autonomous agents and their generated code execute.**NVIDIA NemoClaw:** This is the open-source blueprint that ties the environment together. With a single command, NemoClaw installs the OpenShell runtime—providing a secure environment for running autonomous agents like Hermes Agent more safely alongside open-source models like Nemotron.\n\n## Build safer and voice-enabled agents\n\nTwo new Nemotron models are also launching:\n\n**Nemotron 3.5 Content Safety**\n\nFor teams building safer enterprise AI, [Nemotron 3.5 Content Safety](https://huggingface.co/nvidia/Nemotron-3.5-Content-Safety) is an open, efficient 4B guardrail model for classifying unsafe, disallowed, or policy-violating content across text, images, and combined inputs.\n\nCovering 23 safety categories and 12 languages, it can be used as an inference-time guardrail, as a judge for LLM safety testing and evaluation, or with the accompanying training dataset to post-train models for safer behavior. Custom policy support and reasoning trails help enterprises adapt safety decisions to domain-specific rules, audit classifications, and deploy safety controls across global AI workflows. Read the [Hugging Face post](https://huggingface.co/blog/nvidia/nemotron-3.5-content-safety) to learn more.\n\n**Nemotron 3.5 ASR**\n\nFor voice-native agents, [Nemotron 3.5 ASR](https://huggingface.co/nvidia/NVIDIA-Nemotron-3.5-ASR-Streaming-Multilingual-0.6b) uses the same cache-aware streaming architecture as its English predecessor, Nemotron 3 ASR, to process audio deltas instantly. Eliminating redundant buffered compute ensures sub-100 ms latency for natural, real-time voice orchestration for your agentic swarms.\n\nThe English model has seen strong developer adoption, including powering the voice input feature in [Microsoft GitHub Copilot CLI](https://aka.ms/FL_BUILD_2026), used by more than 20M developers. An independent benchmark of [50+ on-device ASR](https://arxiv.org/html/2604.14493v2) configurations identified Nemotron 3 ASR as the strongest candidate for real-time English streaming on resource-constrained hardware. Now, that same architecture goes multilingual with support for 40+ languages in a single checkpoint.\n\n## Updated open licensing for broader adoption\n\nNemotron model releases are moving to OpenMDW-1.1, the Linux Foundation’s permissive license purpose-built for open AI model distributions. OpenMDW is designed to cover the full set of model materials, including architecture, parameters, documentation, software, and other related artifacts, under a single framework.\n\nThis gives developers and enterprises clearer terms for using, modifying, redistributing, and deploying Nemotron models, while reducing the licensing ambiguity that can slow evaluation and adoption of open models.\n\n## Start building today\n\nNemotron 3 Ultra is fully open—including weights, data, and recipes—so developers can adapt the models to domain-specific workflows and deploy them anywhere. It is available across leading inference platforms and packaged as an NVIDIA NIM microservice, it can run anywhere. Try it on [Perplexity](https://www.perplexity.ai/) with a Pro subscription or through API, [OpenRouter](https://openrouter.ai/nvidia/nemotron-3-ultra-550b-a55b), [Anaconda](https://www.anaconda.com/blog/nvidia-nemotron-3-ultra-available-anaconda), or [build.nvidia.com](https://build.nvidia.com/nvidia/nemotron-3-ultra-550b-a55b).Download the weights from [Hugging Face](https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Ultra-550B-A55B-NVFP4), launch an optimized instance through NVIDIA NIM, or start with the cookbooks to get running in minutes\n\nDownload the weights from [Hugging Face](https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Ultra-550B-A55B-NVFP4), launch an optimized instance through NVIDIA NIM, or start with the cookbooks to get running in minutes.\n\nNemotron 3 Ultra is also available through AWS JumpStart, Amazon EKS, Baseten, Bitdeer AI, CoreWeave, Crusoe, DeepInfra, Dell Enterprise Hub, DigitalOcean, Eigen AI, fal (ASR), Fireworks AI, FriendliAI, GMI Cloud, Google Cloud, Lightning AI, Microsoft Foundry, Modal, Nebius Token Factory, Prime Intellect, Simplismart, Together AI (along with ASR), and Vultr.\n\nCheck out the [GitHub repository](https://github.com/NVIDIA-NeMo/Nemotron/tree/main/usage-cookbook/Nemotron-3-Ultra/) for getting-started instructions for agent harness, including [BlackBox AI](https://www.blackbox.ai/blog/nemotron-on-blackbox), [Cline](https://docs.cline.bot/api/models), [CrewAI](https://docs.crewai.com/en/concepts/llms#nvidia-nemotron), [Factory AI](https://docs.factory.ai/cli/user-guides/choosing-your-model), [Hermes Agent](https://unsloth.ai/docs/models/nemotron-3-ultra), [Kilo Code](https://kilo.ai/docs/gateway/models-and-providers), [LangChain Deep Agents](https://docs.langchain.com/oss/python/deepagents/models), OpenClaw, [OpenCode](https://opencode.ai/docs), [OpenHands](https://docs.openhands.dev/openhands/usage/llms/openrouter), and [Pi](https://pi.dev/models?provider=openrouter&name=nemotron+3+ultra).\n\nFor the full technical details, read the [Nemotron 3 Ultra technical report](https://research.nvidia.com/labs/nemotron/files/NVIDIA-Nemotron-3-Ultra-Technical-Report.pdf).\n\n*Stay up to date on* *NVIDIA Nemotron** **by subscribing to **NVIDIA **news** **and** following NVIDIA AI on **LinkedIn**, **X**, **Discord**, and **YouTube**.*\n\n*Visit the **Nemotron developer page** for resources to get started. Explore open Nemotron models and datasets on **Hugging Face** **and **Blueprints** **on **build.nvidia.com**.*\n\n*Engage with **Nemotron livestreams**, **tutorials**, and the developer community on the **NVIDIA **forum **and** **Discord**.*", "url": "https://wpnews.pro/news/nvidia-nemotron-3-ultra-powers-faster-more-efficient-reasoning-for-long-running", "canonical_source": "https://developer.nvidia.com/blog/nvidia-nemotron-3-ultra-powers-faster-more-efficient-reasoning-for-long-running-agents/", "published_at": "2026-06-04 13:02:49+00:00", "updated_at": "2026-06-04 13:06:46.754118+00:00", "lang": "en", "topics": ["ai-agents", "large-language-models", "ai-infrastructure", "ai-research", "ai-products"], "entities": ["NVIDIA", "Nemotron 3 Ultra"], "alternates": {"html": "https://wpnews.pro/news/nvidia-nemotron-3-ultra-powers-faster-more-efficient-reasoning-for-long-running", "markdown": "https://wpnews.pro/news/nvidia-nemotron-3-ultra-powers-faster-more-efficient-reasoning-for-long-running.md", "text": "https://wpnews.pro/news/nvidia-nemotron-3-ultra-powers-faster-more-efficient-reasoning-for-long-running.txt", "jsonld": "https://wpnews.pro/news/nvidia-nemotron-3-ultra-powers-faster-more-efficient-reasoning-for-long-running.jsonld"}}