Nemotron 3 Ultra went live June 4. Here's the call that works. NVIDIA released Nemotron 3 Ultra on June 4, 2026, a 550-billion-parameter open-weights model that achieves the highest intelligence score among US open models. The model uses a hybrid Mamba-Transformer MoE architecture with 90% sparsity and a 1M-token context window, and it scored 48 on the Artificial Analysis Intelligence Index, trailing Chinese and closed models but offering superior inference speed. NVIDIA shipped Nemotron 3 Ultra on June 4, 2026 — its largest open-weights model and the new high-water mark for US open releases. Before you wire it into an agent harness, here is exactly what landed and where it sits on the leaderboard. Nemotron 3 Ultra is a 550-billion-parameter hybrid Mamba-Transformer mixture-of-experts MoE model with up to roughly 55B active parameters per token — about 90% sparsity — released by NVIDIA on June 4, 2026 . It tops the three-model Nemotron 3 family Nano, Super, Ultra , ships a 1M-token context window, and uses NVFP4 4-bit floating point training on NVIDIA's Blackwell architecture plus a "LatentMoE" hardware-aware expert router . It is post-trained for agent harnesses including Hermes Agent, LangChain Deep Agents, OpenHands, and OpenCode . On the Artificial Analysis Intelligence Index, Ultra scored 48 — the most capable open model from a US lab to date — but it trails the Chinese-led Kimi K2.6 and closed models such as Anthropic's Opus 4.8 : | Model | Type | Intelligence Index | |---|---|---| | Opus 4.8 Anthropic | Closed | 61 | | Kimi K2.6 | Open China | 54 | | Nemotron 3 Ultra | Open US | 48 | Speed is the more interesting story. Evaluating BF16 weights in partnership with NVIDIA, Artificial Analysis measured over 300 tokens/second on a pre-release DeepInfra endpoint, versus roughly 50–100 tok/s for similarly sized Chinese open models like DeepSeek and Moonshot . "Nemotron 3 Ultra lands in what we call the most attractive intelligence-vs-speed quadrant," — Artificial Analysis source: Artificial Analysis . General availability runs through build.nvidia.com as NIM microservices , Hugging Face, OpenRouter, ModelScope, and cloud partners . The rest of this guide covers the call that actually works. Three things gate your first Ultra call: an account, the right hardware, and the right checkpoint. For build.nvidia.com https://research.nvidia.com/labs/nemotron/Nemotron-3/ you need an NVIDIA NGC account and an API key; the free tier covers low-volume prototyping. OpenRouter is the alternative path and uses its own account key instead — pick one, not both. Mind the compute floor. NVIDIA benchmarked Ultra Base on GB200 NVL72, and the smaller Nemotron 3 Super 120B total / 12B active already lists an 8×H100-80GB minimum . The 550B Ultra is larger, so plan for data-center hardware or a hosted endpoint — not a workstation. Finally, use the post-trained instruct checkpoint, not Ultra Base. NVIDIA's own Base usage guide states the base weights have not undergone instruction tuning or alignment and are not a drop-in assistant . Ultra's final public model slug was not in Build/NIM API lists before the June 4 launch, so pull the exact identifier from the live model card before writing any code. The fastest way to call Nemotron 3 Ultra is the OpenAI-compatible Chat Completions API — the same client works across all three delivery paths, only the base url and model slug change. NVIDIA ships Ultra on June 4, 2026 via build.nvidia.com NIM microservices, OpenRouter, and Hugging Face . Pick a path based on whether you want managed inference, a no-NGC fallback, or a self-hosted container. Path 1 — build.nvidia.com hosted NIM . Generate an NGC API key, then instantiate the standard OpenAI Python client with base url="https://integrate.api.nvidia.com/v1" and api key=