cd /news/large-language-models/running-mixtral-8x7b-at-21-tps-on-pu… · home topics large-language-models article
[ARTICLE · art-21868] src=dev.to pub= topic=large-language-models verified=true sentiment=↑ positive

Running Mixtral 8x7B at 21+ TPS on Pure CPU via io_uring and Predictive Caching

Amalgafy Labs has achieved 21+ tokens per second inference on the Mixtral 8x7B model using only CPU and SSD storage, bypassing the need for GPU VRAM. The team's Micro-Expert-Router (MER) system leverages io_uring and predictive caching to stream expert weights from disk on demand, achieving a 97.46% cache hit rate across 10,000 tokens. The open-source implementation runs on standard virtual machines with commodity hardware, demonstrating that low-level systems engineering can enable viable MoE inference without expensive GPU memory.

read1 min publishedJun 4, 2026

The current consensus in AI infrastructure is unyielding: if you want to run frontier Mixture of Experts (MoE) models at usable human-reading inference speeds, you must pay the VRAM premium. The entire model footprint is traditionally pinned into high-bandwidth GPU memory arrays to prevent execution pipelines from grinding to a halt.

At Amalgafy Labs, we built the Micro-Expert-Router (MER) to challenge this assumption.

We wanted to prove that with low-level systems engineering, an intelligent software abstraction layer can turn cheap, abundant, commodity CPU-heavy cloud shapes into high-throughput inference engines.

Yesterday, we took the engine out of the "proven on paper" phase and validated it on live cloud silicon. Running Mixtral 8x7B (47B parameters, q4_0 quantization) on a standard virtual machine utilizing

The full source code is now open-source on GitHub: randyap8-wq/Micro-Expert-Router-SSD-Streamed-MoE-MER.

The benchmark was executed inside an isolated virtual machine environment under strict compute constraints:

io_uring

O_DIRECT

dtype=q4_0

).12648430

).

text
2026-06-04T15:10:41.520446Z  INFO stream complete wall_s=233.828879846 sustained_tps=21.383158501605987 avg_throughput_mibps=103.46455072587074 hit_rate_pct=97.46000000000001
2026-06-04T15:10:41.520511Z  INFO ===================== run summary =====================
2026-06-04T15:10:41.520519Z  INFO experts:       256 (top-2), cache=256 slots, pool=258 slots
2026-06-04T15:10:41.520522Z  INFO ffn shape:      d_model=4096  d_ff=14336  bytes/expert=99090432 (dtype=q4_0)
2026-06-04T15:10:41.520534Z  INFO lookups:       hits=9746  misses=254  hit_rate=97.46%
2026-06-04T15:10:41.520540Z  INFO prefetches:    completed=2  predictor_observations=19996
2026-06-04T15:10:41.520546Z  INFO i/o:           reads=254  bytes=24193.00 MiB
2026-06-04T15:10:41.520557Z  INFO i/o latency:   p50=116543us  p95=233599us  p99=360191us
2026-06-04T15:10:41.520563Z  INFO compute:       p50=40255us  p95=41631us  p99=60735us  (SwiGLU FFN per token)
2026-06-04T15:10:41.520569Z  INFO cycle latency: p50=40287us  p95=42047us  p99=286975us  max=431615us
2026-06-04T15:10:41.520576Z  INFO per-token avg: io_wait=5772.7us  compute=40850.5us  (over 5000 tokens)
2026-06-04T15:10:41.520582Z  INFO I/O share:     12.37% of token cycle time spent waiting on SSD reads
2026-06-04T15:10:41.520588Z  INFO energy knobs:  dtype=q4_0  partial_load_fraction=1.00  pinned=0  alias_redirects=0
2026-06-04T15:10:41.520595Z  INFO =======================================================
── more in #large-language-models 4 stories · sorted by recency
sponsored brought to you by zahid.host 4,200+ EU-deployed projects
reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main
Live at https://your-agent.zahid.host
Get free account → Pricing
from €0/mo · no card required
LIVE [news/running-mixtral-8x7b…] indexed:0 read:1min 2026-06-04 ·