cd /news/large-language-models/xiaomi-mimo-hits-1000-tokens-per-sec… Β· home β€Ί topics β€Ί large-language-models β€Ί article
[ARTICLE Β· art-25732] src=letsdatascience.com pub= topic=large-language-models verified=true sentiment=↑ positive

Xiaomi MiMo Hits 1,000 Tokens-Per-Second Inference

Xiaomi's MiMo-V2.5-Pro-UltraSpeed, a 1.02-trillion-parameter MoE model, achieved 1,000 tokens per second inference on standard cloud GPUs using FP4 quantization, DFlash speculative decoding, and TileRT runtime. The milestone marks the first trillion-parameter model to exceed that rate on commodity hardware, with an open-source checkpoint released for independent testing.

read3 min publishedJun 12, 2026

Memeburn reports that Xiaomi's MiMo-V2.5-Pro-UltraSpeed achieved 1,000 tokens per second inference on standard, rentable cloud GPUs, which the outlet describes as the first trillion-parameter model to exceed that rate. According to Memeburn, the flagship is a 1.02-trillion-parameter Mixture-of-Experts (MoE) model and reached the milestone using three combined techniques: FP4 expert-layer quantization, DFlash block-level speculative decoding, and the TileRT persistent GPU runtime. Memeburn also reports a limited API trial for enterprise and professional developers running June 9-23, 2026, and says the FP4-DFlash checkpoint has been open-sourced on Hugging Face for independent testing. Editorial analysis: Industry observers note that such runtime and quantization gains commonly reduce per-token cost and broaden access to high-throughput LLM inference on commodity GPUs.

What happened

Xiaomi, in collaboration with inference partner TileRT, released MiMo-V2.5-Pro-UltraSpeed on June 8, 2026, claiming the first trillion-parameter model to exceed 1,000 tokens per second on a standard 8-GPU commodity node, with peak throughput near 1,200 tokens per second. The base model, MiMo-V2.5-Pro, a 1.02-trillion-parameter Mixture-of-Experts (MoE) architecture released April 22, 2026, is unchanged; UltraSpeed is a high-speed serving mode layered on top.

Technical approach

The throughput gains combine three co-designed layers. First, FP4 (MXFP4) quantization is applied selectively to MoE Expert layers only, the modules that hold most parameters and tolerate reduced precision best, while other modules retain higher precision. Quantization-Aware Training keeps benchmark performance "essentially on par" with the original, per Xiaomi's own benchmarks. Second, DFlash speculative decoding fills an entire block of masked positions in one forward pass rather than token-by-token, using a Sliding Window Attention draft model to keep per-step compute constant. In coding scenarios the average acceptance length reaches 6.30 per verification round, meaning 6-7 of 8 draft tokens survive. Third, TileRT's persistent GPU kernel eliminates per-operator launch overhead that fractures execution at microsecond scale, using Warp Specialization to overlap data movement and compute continuously.

Industry context

Achieving 1,000+ TPS on a 1T model via software co-design on commodity GPUs, rather than custom silicon such as Cerebras Wafer-Scale or Groq's on-chip SRAM, is significant for teams that cannot access specialized hardware. The MoE architecture keeps per-token active compute lower than equivalent dense models, which made the selective FP4-experts strategy tractable without broad quality regression.

Access and open source

The UltraSpeed API trial runs June 9-23, 2026, is application-based and limited to enterprises and professional developers, and is priced at 3x the standard MiMo-V2.5-Pro rate for roughly 10x the generation speed. Xiaomi has open-sourced the MiMo-V2.5-Pro-FP4-DFlash checkpoint on Hugging Face, including quantized weights and DFlash parameters, enabling independent community benchmarking.

What to watch

Independent community benchmarks of the open-sourced FP4-DFlash checkpoint for quality regressions; replication of the 1,000 TPS figure on common GPU instances beyond the trial; acceptance-length performance in open-ended conversation scenarios where current results are weaker; and whether TileRT's persistent kernel approach ports to standard serving stacks.

Scoring Rationale #

A vendor-reported but well-documented inference milestone: 1,000+ TPS on a 1T MoE model using only commodity GPUs via FP4, DFlash, and a persistent runtime. Relevant to practitioners tracking inference cost and latency. Score at notable rather than major given the limited gated trial, vendor-only benchmarks pending independent replication, and weaker acceptance rates in open-ended conversation.

Practice interview problems based on real data

1,500+ SQL & Python problems across 15 industry datasets β€” the exact type of data you work with.

Try 250 free problems

── more in #large-language-models 4 stories Β· sorted by recency
sponsored brought to you by zahid.host 4,200+ EU-deployed projects
reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain β€” perfect for shipping the agent you just read about.

$git push zahid main
β†’ Live at https://your-agent.zahid.host βœ“
Get free account β†’ Pricing
from €0/mo Β· no card required
LIVE [news/xiaomi-mimo-hits-100…] indexed:0 read:3min 2026-06-12 Β· β€”