{"slug": "running-mixtral-8x7b-at-21-tps-on-pure-cpu-via-io-uring-and-predictive-caching", "title": "Running Mixtral 8x7B at 21+ TPS on Pure CPU via io_uring and Predictive Caching", "summary": "Amalgafy Labs has achieved 21+ tokens per second inference on the Mixtral 8x7B model using only CPU and SSD storage, bypassing the need for GPU VRAM. The team's Micro-Expert-Router (MER) system leverages io_uring and predictive caching to stream expert weights from disk on demand, achieving a 97.46% cache hit rate across 10,000 tokens. The open-source implementation runs on standard virtual machines with commodity hardware, demonstrating that low-level systems engineering can enable viable MoE inference without expensive GPU memory.", "body_md": "The current consensus in AI infrastructure is unyielding: if you want to run frontier Mixture of Experts (MoE) models at usable human-reading inference speeds, you must pay the VRAM premium. The entire model footprint is traditionally pinned into high-bandwidth GPU memory arrays to prevent execution pipelines from grinding to a halt.\n\nAt **Amalgafy Labs**, we built the **Micro-Expert-Router (MER)** to challenge this assumption.\n\nWe wanted to prove that with low-level systems engineering, an intelligent software abstraction layer can turn cheap, abundant, commodity CPU-heavy cloud shapes into high-throughput inference engines.\n\nYesterday, we took the engine out of the \"proven on paper\" phase and validated it on live cloud silicon. Running **Mixtral 8x7B (47B parameters, q4_0 quantization)** on a standard virtual machine utilizing\n\nThe full source code is now open-source on GitHub: [randyap8-wq/Micro-Expert-Router-SSD-Streamed-MoE-MER](https://github.com/randyap8-wq/Micro-Expert-Router-SSD-Streamed-MoE-MER).\n\nThe benchmark was executed inside an isolated virtual machine environment under strict compute constraints:\n\n`io_uring`\n\n`O_DIRECT`\n\n`dtype=q4_0`\n\n).`12648430`\n\n).\n\n```\ntext\n2026-06-04T15:10:41.520446Z  INFO stream complete wall_s=233.828879846 sustained_tps=21.383158501605987 avg_throughput_mibps=103.46455072587074 hit_rate_pct=97.46000000000001\n2026-06-04T15:10:41.520511Z  INFO ===================== run summary =====================\n2026-06-04T15:10:41.520519Z  INFO experts:       256 (top-2), cache=256 slots, pool=258 slots\n2026-06-04T15:10:41.520522Z  INFO ffn shape:      d_model=4096  d_ff=14336  bytes/expert=99090432 (dtype=q4_0)\n2026-06-04T15:10:41.520534Z  INFO lookups:       hits=9746  misses=254  hit_rate=97.46%\n2026-06-04T15:10:41.520540Z  INFO prefetches:    completed=2  predictor_observations=19996\n2026-06-04T15:10:41.520546Z  INFO i/o:           reads=254  bytes=24193.00 MiB\n2026-06-04T15:10:41.520557Z  INFO i/o latency:   p50=116543us  p95=233599us  p99=360191us\n2026-06-04T15:10:41.520563Z  INFO compute:       p50=40255us  p95=41631us  p99=60735us  (SwiGLU FFN per token)\n2026-06-04T15:10:41.520569Z  INFO cycle latency: p50=40287us  p95=42047us  p99=286975us  max=431615us\n2026-06-04T15:10:41.520576Z  INFO per-token avg: io_wait=5772.7us  compute=40850.5us  (over 5000 tokens)\n2026-06-04T15:10:41.520582Z  INFO I/O share:     12.37% of token cycle time spent waiting on SSD reads\n2026-06-04T15:10:41.520588Z  INFO energy knobs:  dtype=q4_0  partial_load_fraction=1.00  pinned=0  alias_redirects=0\n2026-06-04T15:10:41.520595Z  INFO =======================================================\n```\n\n", "url": "https://wpnews.pro/news/running-mixtral-8x7b-at-21-tps-on-pure-cpu-via-io-uring-and-predictive-caching", "canonical_source": "https://dev.to/randyap8wq/running-mixtral-8x7b-at-21-tps-on-pure-cpu-via-iouring-and-predictive-caching-50cd", "published_at": "2026-06-04 19:21:38+00:00", "updated_at": "2026-06-04 19:42:46.696948+00:00", "lang": "en", "topics": ["large-language-models", "ai-infrastructure", "machine-learning", "ai-research", "ai-tools"], "entities": ["Mixtral 8x7B", "Amalgafy Labs", "Micro-Expert-Router", "MER", "io_uring", "O_DIRECT", "GitHub", "Mixtral"], "alternates": {"html": "https://wpnews.pro/news/running-mixtral-8x7b-at-21-tps-on-pure-cpu-via-io-uring-and-predictive-caching", "markdown": "https://wpnews.pro/news/running-mixtral-8x7b-at-21-tps-on-pure-cpu-via-io-uring-and-predictive-caching.md", "text": "https://wpnews.pro/news/running-mixtral-8x7b-at-21-tps-on-pure-cpu-via-io-uring-and-predictive-caching.txt", "jsonld": "https://wpnews.pro/news/running-mixtral-8x7b-at-21-tps-on-pure-cpu-via-io-uring-and-predictive-caching.jsonld"}}