Xiaomi MiMo Hits 1,000 Tokens-Per-Second Inference

wpnews.pro

cd /news/large-language-models/xiaomi-mimo-hits-1000-tokens-per-sec… · home › topics › large-language-models › article

[ARTICLE · art-25732] src=letsdatascience.com ↗ pub=2026-06-12T23:16Z topic=large-language-models verified=true sentiment=↑ positive

Xiaomi MiMo Hits 1,000 Tokens-Per-Second Inference

Xiaomi's MiMo-V2.5-Pro-UltraSpeed, a 1.02-trillion-parameter MoE model, achieved 1,000 tokens per second inference on standard cloud GPUs using FP4 quantization, DFlash speculative decoding, and TileRT runtime. The milestone marks the first trillion-parameter model to exceed that rate on commodity hardware, with an open-source checkpoint released for independent testing.

read3 min views22 publishedJun 12, 2026

Memeburn reports that Xiaomi's MiMo-V2.5-Pro-UltraSpeed achieved 1,000 tokens per second inference on standard, rentable cloud GPUs, which the outlet describes as the first trillion-parameter model to exceed that rate. According to Memeburn, the flagship is a 1.02-trillion-parameter Mixture-of-Experts (MoE) model and reached the milestone using three combined techniques: FP4 expert-layer quantization, DFlash block-level speculative decoding, and the TileRT persistent GPU runtime. Memeburn also reports a limited API trial for enterprise and professional developers running June 9-23, 2026, and says the FP4-DFlash checkpoint has been open-sourced on Hugging Face for independent testing. Editorial analysis: Industry observers note that such runtime and quantization gains commonly reduce per-token cost and broaden access to high-throughput LLM inference on commodity GPUs.

What happened

Xiaomi, in collaboration with inference partner TileRT, released MiMo-V2.5-Pro-UltraSpeed on June 8, 2026, claiming the first trillion-parameter model to exceed 1,000 tokens per second on a standard 8-GPU commodity node, with peak throughput near 1,200 tokens per second. The base model, MiMo-V2.5-Pro, a 1.02-trillion-parameter Mixture-of-Experts (MoE) architecture released April 22, 2026, is unchanged; UltraSpeed is a high-speed serving mode layered on top.

Technical approach

The throughput gains combine three co-designed layers. First, FP4 (MXFP4) quantization is applied selectively to MoE Expert layers only, the modules that hold most parameters and tolerate reduced precision best, while other modules retain higher precision. Quantization-Aware Training keeps benchmark performance "essentially on par" with the original, per Xiaomi's own benchmarks. Second, DFlash speculative decoding fills an entire block of masked positions in one forward pass rather than token-by-token, using a Sliding Window Attention draft model to keep per-step compute constant. In coding scenarios the average acceptance length reaches 6.30 per verification round, meaning 6-7 of 8 draft tokens survive. Third, TileRT's persistent GPU kernel eliminates per-operator launch overhead that fractures execution at microsecond scale, using Warp Specialization to overlap data movement and compute continuously.

Industry context

Achieving 1,000+ TPS on a 1T model via software co-design on commodity GPUs, rather than custom silicon such as Cerebras Wafer-Scale or Groq's on-chip SRAM, is significant for teams that cannot access specialized hardware. The MoE architecture keeps per-token active compute lower than equivalent dense models, which made the selective FP4-experts strategy tractable without broad quality regression.

Access and open source

The UltraSpeed API trial runs June 9-23, 2026, is application-based and limited to enterprises and professional developers, and is priced at 3x the standard MiMo-V2.5-Pro rate for roughly 10x the generation speed. Xiaomi has open-sourced the MiMo-V2.5-Pro-FP4-DFlash checkpoint on Hugging Face, including quantized weights and DFlash parameters, enabling independent community benchmarking.

What to watch

Independent community benchmarks of the open-sourced FP4-DFlash checkpoint for quality regressions; replication of the 1,000 TPS figure on common GPU instances beyond the trial; acceptance-length performance in open-ended conversation scenarios where current results are weaker; and whether TileRT's persistent kernel approach ports to standard serving stacks.

Scoring Rationale #

A vendor-reported but well-documented inference milestone: 1,000+ TPS on a 1T MoE model using only commodity GPUs via FP4, DFlash, and a persistent runtime. Relevant to practitioners tracking inference cost and latency. Score at notable rather than major given the limited gated trial, vendor-only benchmarks pending independent replication, and weaker acceptance rates in open-ended conversation.

Practice interview problems based on real data

1,500+ SQL & Python problems across 15 industry datasets — the exact type of data you work with.

Try 250 free problems

source & further reading

letsdatascience.com — original article Oncoformer Predicts Cancer Risk From Routine Clinical Data SK hynix Reports Record 60.5 Trillion Won Q2 Operating Profit CENTCOM and UAE Establish Bilateral AI Task Force

~/api · this article 200

$curl api.wpnews.pro/v1/news/xiaomi-mimo-hits-1000-to…

Read original on letsdatascience.com → letsdatascience.com/news/xiaomi-mimo-hits-1000-t…

mentioned entities

Xiaomi

MiMo-V2.5-Pro-UltraSpeed

TileRT

Hugging Face

Memeburn

FP4

DFlash

MiMo-V2.5-Pro

metadata

slugxiaomi-mimo-hits-1000-tokens-per-second-inference

topic#large-language-models

secondary4 topics

sentimentpositive

canonicalletsdatascience.com

navigation

← prevErdős Problems and Speculations …

next →A Generated Web

── more in #large-language-models 4 stories · sorted by recency

machinebrief.com · 29 Jul · #large-language-models

OpenAI's Rogue Agent Didn't Stop at Hugging Face. It Breached a Second Company in a Week-Long Spree.

thenextweb.com · 29 Jul · #large-language-models

Visa is cutting 2,600 tech jobs to bankroll the tech that threatens it

unite.ai · 29 Jul · #large-language-models

Qualcomm Closes All-Stock Acquisition of Compiler Startup Modular

aljazeera.com · 29 Jul · #large-language-models

How are AI models able to autonomously hack others?

── more on @xiaomi 3 stories trending now

wpnews · 16 Jul · #artificial-intelligence

Women entrepreneurs are less likely to leverage AI—but more likely to benefit from it

wpnews · 28 Jul · #large-language-models

How to Download and Run Kimi K3 Open Weights

wpnews · 28 Jul · #artificial-intelligence

How Claude Code and VS Code turned Anthropic from a safety lab into a developer phenomenon

sponsored brought to you by zahid.host 4,200+ EU-deployed projects

reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main

→ Live at https://your-agent.zahid.host ✓

Get free account → Pricing

from €0/mo · no card required