# Running Mixtral 8x7B at 21+ TPS on Pure CPU via io_uring and Predictive Caching

> Source: <https://dev.to/randyap8wq/running-mixtral-8x7b-at-21-tps-on-pure-cpu-via-iouring-and-predictive-caching-50cd>
> Published: 2026-06-04 19:21:38+00:00

The current consensus in AI infrastructure is unyielding: if you want to run frontier Mixture of Experts (MoE) models at usable human-reading inference speeds, you must pay the VRAM premium. The entire model footprint is traditionally pinned into high-bandwidth GPU memory arrays to prevent execution pipelines from grinding to a halt.

At **Amalgafy Labs**, we built the **Micro-Expert-Router (MER)** to challenge this assumption.

We wanted to prove that with low-level systems engineering, an intelligent software abstraction layer can turn cheap, abundant, commodity CPU-heavy cloud shapes into high-throughput inference engines.

Yesterday, we took the engine out of the "proven on paper" phase and validated it on live cloud silicon. Running **Mixtral 8x7B (47B parameters, q4_0 quantization)** on a standard virtual machine utilizing

The full source code is now open-source on GitHub: [randyap8-wq/Micro-Expert-Router-SSD-Streamed-MoE-MER](https://github.com/randyap8-wq/Micro-Expert-Router-SSD-Streamed-MoE-MER).

The benchmark was executed inside an isolated virtual machine environment under strict compute constraints:

`io_uring`

`O_DIRECT`

`dtype=q4_0`

).`12648430`

).

```
text
2026-06-04T15:10:41.520446Z  INFO stream complete wall_s=233.828879846 sustained_tps=21.383158501605987 avg_throughput_mibps=103.46455072587074 hit_rate_pct=97.46000000000001
2026-06-04T15:10:41.520511Z  INFO ===================== run summary =====================
2026-06-04T15:10:41.520519Z  INFO experts:       256 (top-2), cache=256 slots, pool=258 slots
2026-06-04T15:10:41.520522Z  INFO ffn shape:      d_model=4096  d_ff=14336  bytes/expert=99090432 (dtype=q4_0)
2026-06-04T15:10:41.520534Z  INFO lookups:       hits=9746  misses=254  hit_rate=97.46%
2026-06-04T15:10:41.520540Z  INFO prefetches:    completed=2  predictor_observations=19996
2026-06-04T15:10:41.520546Z  INFO i/o:           reads=254  bytes=24193.00 MiB
2026-06-04T15:10:41.520557Z  INFO i/o latency:   p50=116543us  p95=233599us  p99=360191us
2026-06-04T15:10:41.520563Z  INFO compute:       p50=40255us  p95=41631us  p99=60735us  (SwiGLU FFN per token)
2026-06-04T15:10:41.520569Z  INFO cycle latency: p50=40287us  p95=42047us  p99=286975us  max=431615us
2026-06-04T15:10:41.520576Z  INFO per-token avg: io_wait=5772.7us  compute=40850.5us  (over 5000 tokens)
2026-06-04T15:10:41.520582Z  INFO I/O share:     12.37% of token cycle time spent waiting on SSD reads
2026-06-04T15:10:41.520588Z  INFO energy knobs:  dtype=q4_0  partial_load_fraction=1.00  pinned=0  alias_redirects=0
2026-06-04T15:10:41.520595Z  INFO =======================================================
```


