Most people know
Xiaomi for phones and scooters. Not for breaking AI inference records. That
changes today. Working with inference partner TileRT, Xiaomi has hit over 1,000 tokens per second on a 1-trillion-parameter model — the first time that barrier has been crossed at this scale — using nothing but a standard 8-GPU commodity server.
Summary
- 1,000+ tokens per second confirmed: Peak demos reached 1,200 tokens per second on the 1-trillion-parameter MiMo-V2.5-Pro model — a first at this scale without custom silicon.
- Three-layer engineering: FP4 quantization on expert layers, DFlash speculative decoding, and TileRT's persistent-core GPU runtime combine to eliminate latency at every stage.
- 10x faster, 3x the price: UltraSpeed API costs three times the standard MiMo-V2.5-Pro rate — but delivers roughly ten times the output speed.
- Limited trial June 9–23: Application-based access, enterprise and developer priority, two-week free Chat included with approval.
- Open-source checkpoint released: Xiaomi published the MiMo-V2.5-Pro-FP4-DFlash checkpoint on Hugging Face; TileRT open-sourced select modules on GitHub.
Why 1,000 Tokens Per Second Actually Matters
To understand why this is interesting, you need a reference point. Claude Opus 4.6 lands around 71 tokens per second with the lower end model, Haiku, touching 98 tokens per second — and Gemini Flash hits 192 tokens per second. MiMo-V2.5-Pro in UltraSpeed mode runs at over 1,000. That's not a marginal improvement. It's a different category entirely.
At that speed, use cases that were previously off the table become viable. Real-time fraud detection, live trading signals, parallel reasoning chains, and multi-agent loops all have hard latency ceilings that standard inference speeds can't meet. At 1,000 tokens per second, they can. Applications that were previously impossible become viable — fraud detection, real-time trading signals, parallel reasoning chains, and live agent loops all have hard latency requirements that 68 tokens per second cannot meet.
How They Did It
The speedup comes from three layers: FP4 quantization, DFlash speculative decoding, and the TileRT runtime. FP4 is applied only to MoE Experts, with QAT keeping capability essentially on par. DFlash predicts a whole masked block per forward pass, hitting 6.30 average acceptance length in coding tasks. The TileRT runtime then restructures GPU execution with persistent cores and heterogeneous pipelines, eliminating the delay from operator switching and keeping hardware running at full capacity throughout.
The whole thing runs on a single standard 8-GPU node. No custom chips. No specialized hardware. That's the part that matters most — it means the barrier to deploying ultra-fast trillion-parameter inference drops significantly for any organization with standard GPU infrastructure.
The Trial and the Caveats
Access is gated. The trial window runs June 9 to June 23, applications only, with enterprises and professional developers prioritized. Approved users get a two-week free Chat experience with usage guardrails: 10 queue entries per account daily, 30-minute session caps, and automatic release after 5 minutes idle. The Token Plan is not supported — API trial access only.
Independent third-party speed verification isn't public yet. Xiaomi's own numbers are the primary source. The open-source checkpoint on Hugging Face gives the community a path to verify the claims independently.