{"slug": "nvidia-rubin-s-10x-cheaper-tokens-hide-a-footnote", "title": "Nvidia Rubin's 10x Cheaper Tokens Hide a Footnote", "summary": "Nvidia's Vera Rubin NVL72, announced at CES and detailed at GTC, promises up to 10x lower cost per token than Blackwell, but the headline figure depends on ideal conditions: mixture-of-experts models, long contexts, and full-rack scale. The 10x benchmark uses FP4 quantization on the Kimi-K2-Thinking MoE model, meaning dense models or short-context workloads see far smaller gains. Additionally, the rack's high power draw (up to 2,000W per GPU) shifts constraints from chip supply to power delivery and liquid cooling, potentially stranding the cheaper tokens in facilities that cannot support the infrastructure.", "body_md": "A single number is already loose in 2026 budget decks: up to 10x lower cost per token than Blackwell. That is Nvidia's headline for the Vera Rubin NVL72, launched at CES in January and detailed at GTC in March. Per Nvidia's newsroom and developer blog, the same rack also promises up to 5x greater inference performance and a 4x cut in the GPUs needed to train a mixture-of-experts model, all measured against the current Blackwell generation.\n\nIf you are signing a GPU commit this quarter, that 10x is quietly rewriting your plan whether you have read the footnotes or not. So read the footnotes.\n\nThe thing to internalize first has nothing to do with silicon. It is timing.\n\nThe 10x and the ship date run on two separate clocks, and they are not synchronized. The marketing clock started in January 2026, the moment the slide went up. The deployment clock, by Nvidia's own guidance, starts shipping in the second half of 2026 and widens toward broad availability into 2027. Most capacity mistakes I see this year come from reading the first clock and acting as if it were the second.\n\nCut your Blackwell order today on the strength of a January slide and you open a capacity hole in the exact window demand is climbing fastest. Bank the full 10x in your pricing model and you have promised finance a margin that depends on FP4 quantization, MoE routing, and a rack you cannot physically rack yet. Two different errors, same root cause: treating a benchmark as a purchase order.\n\nHere is the part the slide compresses. Per Tom's Hardware's CES coverage, the \"up to 10x lower cost per token\" is benchmarked on the Kimi-K2-Thinking MoE model at 32K input and 8K output tokens. Read that twice. It is a mixture-of-experts model, a long-context measurement, taken at full rack scale.\n\nA dense model does not see that multiplier. A short-context workload does not. A single node, pulled out of the 72-GPU fabric, does not. The 10x is a ceiling struck under near-ideal conditions, not a floor you inherit by buying the hardware. If your production traffic is dense models at 4K context, the honest planning number is a fraction of the headline, and you have to derive it yourself.\n\nThe efficiency story rides on one format. Nvidia's developer blog quotes 50 PFLOPS of NVFP4 inference per Rubin GPU and 35 PFLOPS of NVFP4 training, with the inference figure framed as 5x Blackwell. NVFP4 is four-bit. That is where the cheaper tokens come from.\n\nSo ask the uncomfortable question about your own stack. If you serve FP8 or BF16 today, and you have not validated four-bit accuracy on your actual models with your actual eval set, the 10x is not yours. The hardware exposes cheaper tokens. Your engineering has to go claim them, and quantization that holds accuracy on a benchmark MoE can quietly wreck a smaller fine-tuned model on your traffic. This is the work that gets skipped because it is unglamorous, and it is exactly the work that decides whether the budgeted number shows up.\n\nCheaper per token does not mean cheaper to house. The opposite, in fact.\n\nPer Nvidia and VideoCardz, a Vera Rubin NVL72 rack packs 72 Rubin GPUs (144 GPU dies) and 36 Vera CPUs, delivering up to 3.6 NVFP4 exaFLOPS of inference and 1.2 FP8 exaFLOPS of training. The Rubin GPU carries 336 billion transistors, roughly 1.6x Blackwell, on TSMC 3nm, with a per-chip TDP reported around 2,000W. Each GPU gets 288 GB of HBM4 at up to 22 TB/s.\n\nDo the rack-level arithmetic on that TDP and the second-order fact jumps out. The per-token cost falls while the per-rack power and cooling burden climbs. For anyone planning a colo footprint, the constraint quietly migrates from chip supply to power delivery and liquid cooling. The cheapest token in the world is stranded if your facility cannot land a high-density liquid-cooled rack, and a lot of existing data center space cannot, not without a capital project that takes longer than the GPUs do to arrive.\n\nRubin is not a GPU you drop into last year's chassis. Nvidia's developer blog names six new chips in the platform: the Vera CPU (88 custom Olympus cores), the Rubin GPU, an NVLink 6 switch, ConnectX-9, the BlueField-4 DPU, and a Spectrum-6 Ethernet switch.\n\nA performance win that depends on co-designed networking and DPUs is a win that depends on you adopting more of the stack, and on that stack passing qualification in your environment. That is the quiet tax on the deployment clock. First silicon is one date. A fully qualified, networking-and-DPU-integrated rack running your serving software in production is a later one, and it is the date that actually governs when the cheaper tokens land in your P&L.\n\nI should argue against my own thesis here, because the strongest objection is real. Rubin being months out is only half the comparison. The other half is that Blackwell keeps getting faster while you wait, through software, via TensorRT-LLM and Dynamo serving gains, not new hardware. The marginal cost per token on B200 and B300 in mid-2026 is not frozen at last year's figure.\n\nSo the decision is not \"expensive Blackwell now versus cheap Rubin later.\" It is \"improving Blackwell I can deploy this quarter versus a bigger step I cannot rack until 2027.\" Framed that way, waiting looks a lot less obvious.\n\nOne more figure to handle carefully. Analyst write-ups have floated roughly $0.02 to $0.03 per million tokens for dense inference on Rubin. That is a third-party extrapolation that folds in its own utilization and quantization assumptions. It is not an Nvidia list price, and it does not belong pasted into a P&L as a quoted number.\n\nConcrete moves, each tied to a number above:\n\nThe 10x is real. It is just on a clock you don't control, in a format you haven't validated, in a rack your facility may not be able to power. Plan to the clock you can control.", "url": "https://wpnews.pro/news/nvidia-rubin-s-10x-cheaper-tokens-hide-a-footnote", "canonical_source": "https://dev.to/indra_gustiprasetya_a80a/nvidia-rubins-10x-cheaper-tokens-hide-a-footnote-4362", "published_at": "2026-06-16 12:01:28+00:00", "updated_at": "2026-06-16 12:17:44.109486+00:00", "lang": "en", "topics": ["artificial-intelligence", "large-language-models", "ai-infrastructure", "ai-chips", "ai-products"], "entities": ["Nvidia", "Vera Rubin NVL72", "Blackwell", "Tom's Hardware", "VideoCardz", "Kimi-K2-Thinking", "TSMC", "HBM4"], "alternates": {"html": "https://wpnews.pro/news/nvidia-rubin-s-10x-cheaper-tokens-hide-a-footnote", "markdown": "https://wpnews.pro/news/nvidia-rubin-s-10x-cheaper-tokens-hide-a-footnote.md", "text": "https://wpnews.pro/news/nvidia-rubin-s-10x-cheaper-tokens-hide-a-footnote.txt", "jsonld": "https://wpnews.pro/news/nvidia-rubin-s-10x-cheaper-tokens-hide-a-footnote.jsonld"}}