The current consensus in AI infrastructure is unyielding: if you want to run frontier Mixture of Experts (MoE) models at usable human-reading inference speeds, you must pay the VRAM premium. The entire model footprint is traditionally pinned into high-bandwidth GPU memory arrays to prevent execution pipelines from grinding to a halt.
At Amalgafy Labs, we built the Micro-Expert-Router (MER) to challenge this assumption.
We wanted to prove that with low-level systems engineering, an intelligent software abstraction layer can turn cheap, abundant, commodity CPU-heavy cloud shapes into high-throughput inference engines.
Yesterday, we took the engine out of the "proven on paper" phase and validated it on live cloud silicon. Running Mixtral 8x7B (47B parameters, q4_0 quantization) on a standard virtual machine utilizing
The full source code is now open-source on GitHub: randyap8-wq/Micro-Expert-Router-SSD-Streamed-MoE-MER.
The benchmark was executed inside an isolated virtual machine environment under strict compute constraints:
io_uring
O_DIRECT
dtype=q4_0
).12648430
).
text
2026-06-04T15:10:41.520446Z INFO stream complete wall_s=233.828879846 sustained_tps=21.383158501605987 avg_throughput_mibps=103.46455072587074 hit_rate_pct=97.46000000000001
2026-06-04T15:10:41.520511Z INFO ===================== run summary =====================
2026-06-04T15:10:41.520519Z INFO experts: 256 (top-2), cache=256 slots, pool=258 slots
2026-06-04T15:10:41.520522Z INFO ffn shape: d_model=4096 d_ff=14336 bytes/expert=99090432 (dtype=q4_0)
2026-06-04T15:10:41.520534Z INFO lookups: hits=9746 misses=254 hit_rate=97.46%
2026-06-04T15:10:41.520540Z INFO prefetches: completed=2 predictor_observations=19996
2026-06-04T15:10:41.520546Z INFO i/o: reads=254 bytes=24193.00 MiB
2026-06-04T15:10:41.520557Z INFO i/o latency: p50=116543us p95=233599us p99=360191us
2026-06-04T15:10:41.520563Z INFO compute: p50=40255us p95=41631us p99=60735us (SwiGLU FFN per token)
2026-06-04T15:10:41.520569Z INFO cycle latency: p50=40287us p95=42047us p99=286975us max=431615us
2026-06-04T15:10:41.520576Z INFO per-token avg: io_wait=5772.7us compute=40850.5us (over 5000 tokens)
2026-06-04T15:10:41.520582Z INFO I/O share: 12.37% of token cycle time spent waiting on SSD reads
2026-06-04T15:10:41.520588Z INFO energy knobs: dtype=q4_0 partial_load_fraction=1.00 pinned=0 alias_redirects=0
2026-06-04T15:10:41.520595Z INFO =======================================================