Nemotron 3 Ultra went live June 4. Here's the call that works.

NVIDIA released Nemotron 3 Ultra on June 4, 2026, a 550-billion-parameter open-weights model that achieves the highest intelligence score among US open models. The model uses a hybrid Mamba-Transformer MoE architecture with 90% sparsity and a 1M-token context window, and it scored 48 on the Artificial Analysis Intelligence Index, trailing Chinese and closed models but offering superior inference speed.

NVIDIA shipped Nemotron 3 Ultra on June 4, 2026 — its largest open-weights model and the new high-water mark for US open releases. Before you wire it into an agent harness, here is exactly what landed and where it sits on the leaderboard. Nemotron 3 Ultra is a 550-billion-parameter hybrid Mamba-Transformer mixture-of-experts MoE model with up to roughly 55B active parameters per token — about 90% sparsity — released by NVIDIA on June 4, 2026 . It tops the three-model Nemotron 3 family Nano, Super, Ultra , ships a 1M-token context window, and uses NVFP4 4-bit floating point training on NVIDIA's Blackwell architecture plus a "LatentMoE" hardware-aware expert router . It is post-trained for agent harnesses including Hermes Agent, LangChain Deep Agents, OpenHands, and OpenCode . On the Artificial Analysis Intelligence Index, Ultra scored 48 — the most capable open model from a US lab to date — but it trails the Chinese-led Kimi K2.6 and closed models such as Anthropic's Opus 4.8 : | Model | Type | Intelligence Index | |---|---|---| | Opus 4.8 Anthropic | Closed | 61 | | Kimi K2.6 | Open China | 54 | | Nemotron 3 Ultra | Open US | 48 | Speed is the more interesting story. Evaluating BF16 weights in partnership with NVIDIA, Artificial Analysis measured over 300 tokens/second on a pre-release DeepInfra endpoint, versus roughly 50–100 tok/s for similarly sized Chinese open models like DeepSeek and Moonshot . "Nemotron 3 Ultra lands in what we call the most attractive intelligence-vs-speed quadrant," — Artificial Analysis source: Artificial Analysis . General availability runs through build.nvidia.com as NIM microservices , Hugging Face, OpenRouter, ModelScope, and cloud partners . The rest of this guide covers the call that actually works. Three things gate your first Ultra call: an account, the right hardware, and the right checkpoint. For build.nvidia.com https://research.nvidia.com/labs/nemotron/Nemotron-3/ you need an NVIDIA NGC account and an API key; the free tier covers low-volume prototyping. OpenRouter is the alternative path and uses its own account key instead — pick one, not both. Mind the compute floor. NVIDIA benchmarked Ultra Base on GB200 NVL72, and the smaller Nemotron 3 Super 120B total / 12B active already lists an 8×H100-80GB minimum . The 550B Ultra is larger, so plan for data-center hardware or a hosted endpoint — not a workstation. Finally, use the post-trained instruct checkpoint, not Ultra Base. NVIDIA's own Base usage guide states the base weights have not undergone instruction tuning or alignment and are not a drop-in assistant . Ultra's final public model slug was not in Build/NIM API lists before the June 4 launch, so pull the exact identifier from the live model card before writing any code. The fastest way to call Nemotron 3 Ultra is the OpenAI-compatible Chat Completions API — the same client works across all three delivery paths, only the base url and model slug change. NVIDIA ships Ultra on June 4, 2026 via build.nvidia.com NIM microservices, OpenRouter, and Hugging Face . Pick a path based on whether you want managed inference, a no-NGC fallback, or a self-hosted container. Path 1 — build.nvidia.com hosted NIM . Generate an NGC API key, then instantiate the standard OpenAI Python client with base url="https://integrate.api.nvidia.com/v1" and api key=<NGC key . Set model= to the exact slug printed on the live Ultra model card, enable streaming, and read tokens from the response. The confirmed pattern from the Nemotron 3 Super Build page uses the same client with a slug such as nvidia/nemotron-3-super-120b-a12b and streamed reasoning content chunks . This illustrative snippet not executed — it needs a live key and the final slug shows the minimal HTTP call: python import json import os import urllib.request api key = os.environ.get "NVIDIA API KEY" if not api key: raise SystemExit "Set NVIDIA API KEY" payload = { "model": "nvidia/nemotron-3-ultra", "messages": {"role": "user", "content": "Say hello in one sentence."} , "max tokens": 64, "stream": False, } req = urllib.request.Request "https://integrate.api.nvidia.com/v1/chat/completions", data=json.dumps payload .encode , headers={ "Authorization": f"Bearer {api key}", "Content-Type": "application/json", "Accept": "application/json", }, with urllib.request.urlopen req, timeout=30 as r: data = json.load r print data "choices" 0 "message" "content" Path 2 — OpenRouter. Identical client code, but point base url at https://openrouter.ai/api/v1 with an OpenRouter key — no NGC credential required. This is a useful fallback while the NIM slug propagates across regions . Path 3 — self-hosted NIM container. Run docker login nvcr.io with NGC credentials, then docker run --gpus all -p 8000:8000 <NIM image and POST a standard messages payload to http://0.0.0.0:8000/v1/chat/completions . For inference defaults, borrow the published Nemotron 3 Super model-card values until the Ultra card states otherwise: temperature=1.0 and top p=0.95 across reasoning, tool-calling, and general chat. Toggle extended reasoning with enable thinking=True/False in the chat-template kwargs; reasoning tokens then arrive in the reasoning content field of each streamed chunk . Three traps will burn time if you skip them. The first is the checkpoint itself: NVIDIA's Ultra Base usage guide states the 550B-total / up-to-55B-active hybrid Mamba-Transformer MoE checkpoint has not undergone instruction tuning or post-training alignment, and is a starting point for domain fine-tuning and RL — not a drop-in assistant . Call Base directly as a chatbot and you get incoherent output. Wait for the post-trained card before wiring it into a pipeline. The second is compute. Every throughput figure NVIDIA publishes references GB200 NVL72, and the smaller Super 120B/12B-active already lists an 8x H100-80GB minimum, so full Ultra is a multi-GPU data-center workload . DGX Spark GB10 SoC, 128 GB unified memory targets Nano and quantized Super tiers, not Ultra . NVFP4 — Ultra's intended cost-reduction path — needs Blackwell-class silicon; Ampere or Hopper clusters cannot claim the FP4 savings, so your real per-token cost runs above NVIDIA's headline figures . The third is slug lag. Hugging Face's NVIDIA profile still described Ultra as in development on at least one checked page before launch . Never copy a model slug from a third-party blog — read the live model card and paste the exact string. Once you have a working call, the deeper value is post-training and orchestration. The NVIDIA-NeMo/Nemotron https://github.com/NVIDIA-NeMo/Nemotron repo exposes the full Pretrain → SFT → RL pipeline; teams building domain-specific variants should start with the SFT recipe under training/ and the usage-cookbook/ for tool-calling and RAG patterns. NVIDIA's recommended topology keeps cost down: use Ultra as the planner/reasoner for hard coding or research steps, and route cheaper Nano or Super sub-agents for perception, routing, and summarization source: DataCamp, 2026 https://www.datacamp.com/blog/nvidia-nemotron-3 . Two vendor figures still need replication on your own corpora: 60% fewer reasoning tokens versus Nemotron 2 Nano and a 91% PinchBench agent-productivity score. Treat both as hypotheses until the June 4 weights and endpoints let you measure them directly. The takeaway: ship the hosted call today, but earn the cost and accuracy claims with your own evals before you wire Ultra into production. Yes. NVIDIA states Ultra reaches general availability on June 4, 2026 , hosted via build.nvidia.com as NIM microservices, OpenRouter, Hugging Face, and select cloud partners . For the lowest-friction path, generate an NGC API key, then call the OpenAI-compatible Chat Completions endpoint at https://integrate.api.nvidia.com/v1 using the exact model slug shown on the published Ultra page. Ultra Base is an unaligned pretrained checkpoint — a 550B-total, up-to-55B-active hybrid Mamba-Transformer MoE intended as a starting point for SFT and RL post-training, not a drop-in assistant. NVIDIA's own usage guide states explicitly that the base checkpoint has not undergone instruction tuning or post-training alignment and is not meant for out-of-the-box production use . For chat, reasoning, and tool-calling, call the post-trained instruct variant once its model card is live. No. NVIDIA measured Ultra's throughput on the GB200 NVL72 platform, and even the smaller Super 120B/12B-active lists an 8x H100-80GB minimum — so Ultra realistically requires multi-GPU or data-center hardware . The DGX Spark GB10 SoC, 128 GB unified memory targets the Nano and quantized Super tiers, not full Ultra . Without that cluster, use a hosted endpoint. Artificial Analysis scores Ultra 48 on its Intelligence Index — the most capable US open-weights model as of June 2026 — ahead of Gemma 4 31B 39 and Nemotron 3 Super 36 . It still trails the Chinese open-weights Kimi K2.6 54 and closed models such as Anthropic's Opus 4.8 61 . Ultra leads the US open field but is not at the closed-model frontier. Until an Ultra-specific model card confirms otherwise, the best documented defaults come from the Nemotron 3 Super card: temperature=1.0 and top p=0.95 across reasoning, tool-calling, and general chat . Toggle extended reasoning via enable thinking=True/False in chat-template kwargs; reasoning tokens stream back in the reasoning content field. Validate these against your own workload once the June 4 weights are live.