Running Mixtral 8x7B at 21+ TPS on Pure CPU via io_uring and Predictive Caching

Amalgafy Labs has achieved 21+ tokens per second inference on the Mixtral 8x7B model using only CPU and SSD storage, bypassing the need for GPU VRAM. The team's Micro-Expert-Router (MER) system leverages io_uring and predictive caching to stream expert weights from disk on demand, achieving a 97.46% cache hit rate across 10,000 tokens. The open-source implementation runs on standard virtual machines with commodity hardware, demonstrating that low-level systems engineering can enable viable MoE inference without expensive GPU memory.

The current consensus in AI infrastructure is unyielding: if you want to run frontier Mixture of Experts MoE models at usable human-reading inference speeds, you must pay the VRAM premium. The entire model footprint is traditionally pinned into high-bandwidth GPU memory arrays to prevent execution pipelines from grinding to a halt. At Amalgafy Labs , we built the Micro-Expert-Router MER to challenge this assumption. We wanted to prove that with low-level systems engineering, an intelligent software abstraction layer can turn cheap, abundant, commodity CPU-heavy cloud shapes into high-throughput inference engines. Yesterday, we took the engine out of the "proven on paper" phase and validated it on live cloud silicon. Running Mixtral 8x7B 47B parameters, q4 0 quantization on a standard virtual machine utilizing The full source code is now open-source on GitHub: randyap8-wq/Micro-Expert-Router-SSD-Streamed-MoE-MER https://github.com/randyap8-wq/Micro-Expert-Router-SSD-Streamed-MoE-MER . The benchmark was executed inside an isolated virtual machine environment under strict compute constraints: io uring O DIRECT dtype=q4 0 . 12648430 . text 2026-06-04T15:10:41.520446Z INFO stream complete wall s=233.828879846 sustained tps=21.383158501605987 avg throughput mibps=103.46455072587074 hit rate pct=97.46000000000001 2026-06-04T15:10:41.520511Z INFO ===================== run summary ===================== 2026-06-04T15:10:41.520519Z INFO experts: 256 top-2 , cache=256 slots, pool=258 slots 2026-06-04T15:10:41.520522Z INFO ffn shape: d model=4096 d ff=14336 bytes/expert=99090432 dtype=q4 0 2026-06-04T15:10:41.520534Z INFO lookups: hits=9746 misses=254 hit rate=97.46% 2026-06-04T15:10:41.520540Z INFO prefetches: completed=2 predictor observations=19996 2026-06-04T15:10:41.520546Z INFO i/o: reads=254 bytes=24193.00 MiB 2026-06-04T15:10:41.520557Z INFO i/o latency: p50=116543us p95=233599us p99=360191us 2026-06-04T15:10:41.520563Z INFO compute: p50=40255us p95=41631us p99=60735us SwiGLU FFN per token 2026-06-04T15:10:41.520569Z INFO cycle latency: p50=40287us p95=42047us p99=286975us max=431615us 2026-06-04T15:10:41.520576Z INFO per-token avg: io wait=5772.7us compute=40850.5us over 5000 tokens 2026-06-04T15:10:41.520582Z INFO I/O share: 12.37% of token cycle time spent waiting on SSD reads 2026-06-04T15:10:41.520588Z INFO energy knobs: dtype=q4 0 partial load fraction=1.00 pinned=0 alias redirects=0 2026-06-04T15:10:41.520595Z INFO =======================================================