Running Mixtral 8x7B at 21+ TPS on Pure CPU via io_uring and Predictive Caching

wpnews.pro

cd /news/large-language-models/running-mixtral-8x7b-at-21-tps-on-pu… · home › topics › large-language-models › article

[ARTICLE · art-21868] src=dev.to ↗ pub=2026-06-04T19:21Z topic=large-language-models verified=true sentiment=↑ positive

Running Mixtral 8x7B at 21+ TPS on Pure CPU via io_uring and Predictive Caching

Amalgafy Labs has achieved 21+ tokens per second inference on the Mixtral 8x7B model using only CPU and SSD storage, bypassing the need for GPU VRAM. The team's Micro-Expert-Router (MER) system leverages io_uring and predictive caching to stream expert weights from disk on demand, achieving a 97.46% cache hit rate across 10,000 tokens. The open-source implementation runs on standard virtual machines with commodity hardware, demonstrating that low-level systems engineering can enable viable MoE inference without expensive GPU memory.

read1 min views17 publishedJun 4, 2026

The current consensus in AI infrastructure is unyielding: if you want to run frontier Mixture of Experts (MoE) models at usable human-reading inference speeds, you must pay the VRAM premium. The entire model footprint is traditionally pinned into high-bandwidth GPU memory arrays to prevent execution pipelines from grinding to a halt.

At Amalgafy Labs, we built the Micro-Expert-Router (MER) to challenge this assumption.

We wanted to prove that with low-level systems engineering, an intelligent software abstraction layer can turn cheap, abundant, commodity CPU-heavy cloud shapes into high-throughput inference engines.

Yesterday, we took the engine out of the "proven on paper" phase and validated it on live cloud silicon. Running Mixtral 8x7B (47B parameters, q4_0 quantization) on a standard virtual machine utilizing

The full source code is now open-source on GitHub: randyap8-wq/Micro-Expert-Router-SSD-Streamed-MoE-MER.

The benchmark was executed inside an isolated virtual machine environment under strict compute constraints:

io_uring

O_DIRECT

dtype=q4_0

).12648430

text
2026-06-04T15:10:41.520446Z  INFO stream complete wall_s=233.828879846 sustained_tps=21.383158501605987 avg_throughput_mibps=103.46455072587074 hit_rate_pct=97.46000000000001
2026-06-04T15:10:41.520511Z  INFO ===================== run summary =====================
2026-06-04T15:10:41.520519Z  INFO experts:       256 (top-2), cache=256 slots, pool=258 slots
2026-06-04T15:10:41.520522Z  INFO ffn shape:      d_model=4096  d_ff=14336  bytes/expert=99090432 (dtype=q4_0)
2026-06-04T15:10:41.520534Z  INFO lookups:       hits=9746  misses=254  hit_rate=97.46%
2026-06-04T15:10:41.520540Z  INFO prefetches:    completed=2  predictor_observations=19996
2026-06-04T15:10:41.520546Z  INFO i/o:           reads=254  bytes=24193.00 MiB
2026-06-04T15:10:41.520557Z  INFO i/o latency:   p50=116543us  p95=233599us  p99=360191us
2026-06-04T15:10:41.520563Z  INFO compute:       p50=40255us  p95=41631us  p99=60735us  (SwiGLU FFN per token)
2026-06-04T15:10:41.520569Z  INFO cycle latency: p50=40287us  p95=42047us  p99=286975us  max=431615us
2026-06-04T15:10:41.520576Z  INFO per-token avg: io_wait=5772.7us  compute=40850.5us  (over 5000 tokens)
2026-06-04T15:10:41.520582Z  INFO I/O share:     12.37% of token cycle time spent waiting on SSD reads
2026-06-04T15:10:41.520588Z  INFO energy knobs:  dtype=q4_0  partial_load_fraction=1.00  pinned=0  alias_redirects=0
2026-06-04T15:10:41.520595Z  INFO =======================================================

source & further reading

dev.to — original article When Does Self-Hosting an LLM Actually Beat the API? The Break-Even Math Shadow AI: What Your Team Is Already Pasting Into ChatGPT Your Application Is Ready... According to Whom?

~/api · this article 200

$curl api.wpnews.pro/v1/news/running-mixtral-8x7b-at-…

Read original on dev.to → dev.to/randyap8wq/running-mixtral-8x7b-at-21-tps…

mentioned entities

Mixtral 8x7B

Amalgafy Labs

Micro-Expert-Router

MER

io_uring

O_DIRECT

GitHub

Mixtral

metadata

slugrunning-mixtral-8x7b-at-21-tps-on-pure-cpu-via-io-uring-and-predictive-caching

topic#large-language-models

secondary4 topics

sentimentpositive

canonicaldev.to

navigation

← prevApple approves Poke as the first…

next →(Mis)generalization of Helpful-O…

── more in #large-language-models 4 stories · sorted by recency

router-website-ramp.vercel.app · 20 Jul · #large-language-models

Ramp AI Router

dev.to · 20 Jul · #large-language-models

VernLLM - lightweight resilience layer for OpenAI SDK

dev.to · 20 Jul · #large-language-models

When Does Self-Hosting an LLM Actually Beat the API? The Break-Even Math

sourcefeed.dev · 20 Jul · #large-language-models

The MLX Crowd Finally Gets Its Own LM Studio

── more on @mixtral 8x7b 3 stories trending now

wpnews · 26 May · #ai-agents

Think, Durable Objects, and the Real Shape of AI Applications

wpnews · 28 May · #ai-tools

Grok Build introduces /remember command for persistent context across coding sessions

wpnews · 30 May · #ai-safety

Nightcord Security Analysis Report - Threat Investigation

sponsored brought to you by zahid.host 4,200+ EU-deployed projects

reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main

→ Live at https://your-agent.zahid.host ✓

Get free account → Pricing

from €0/mo · no card required