I built a Rust inference engine that streams MoE expert weights from NVMe SSDs, no GPU required

A developer built Micro-Expert-Router, a Rust inference engine that streams Mixture-of-Experts model weights directly from NVMe SSDs using io_uring with O_DIRECT, eliminating the need for GPU VRAM. The engine targets models like Mixtral 8x7B by loading only the 2 of 8 active experts needed per token step, reducing the active weight footprint from 24GB to approximately 6GB. The system exposes an OpenAI-compatible HTTP API with SSE streaming and includes multi-tier caching, quantization support, speculative decoding, and continuous batching, though performance projections of 11–15 tokens per second remain unverified due to lack of benchmark hardware.

Most people trying to run Mixtral or DeepSeek-V3 locally hit the same wall: they don't have 80GB of VRAM. The common answer is "get better hardware." I wanted to see if there was another way. https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F989tsu88kkq3xvlco6w7.png The idea is straightforward. Based on Apple’s landmark research paper, titled "LLM in a flash: Efficient Large Language Model Inference with Limited Memory" NVMe SSDs have gotten fast enough, PCIe Gen5 arrays are hitting ~56 GB/s, so you can treat them as a first-class memory tier for LLM inference instead of just storage. For Mixture-of-Experts models specifically, this is interesting because at any given token step, you only need 2 of 8 experts active. That's ~6GB of active weights on Mixtral 8x7B, not 24GB. Micro-Expert-Router is the result. It's a Rust inference engine that streams MoE expert weights directly from NVMe via io uring with O DIRECT, routes tokens through real SwiGLU FFN kernels, and exposes an OpenAI-compatible HTTP API with SSE streaming. What's in it: SSD-streamed expert loading via io uring fixed buffers and O DIRECT pread Multi-tier expert cache: SSD → RAM LRU with pinning → VRAM Q4 0, Q4K, Q8 0, F16 quantization with AVX2/AVX-512/AMX dispatch Speculative decoding with a draft engine tied to the main model embeddings Continuous batching with weighted round-robin admission SafeTensors loader, SIGHUP hot reload, TUI dashboard, Helm chart Honest disclaimer on the numbers: I don't have the hardware to run full benchmarks yet. The telemetry figures in the repo 11–15 tokens/sec across edge workstation, sovereign box, and RPC sharded cluster topologies are theoretical ceilings derived from active weight footprint and raw NVMe sequential bandwidth at 80% cache hit rate — not measured results. Cold I/O latency projections range from 108ms on a Quad Gen5 U.2 array down to 1010ms on a PCIe Gen4 M.2. The closest published prior art is Apple's LLM in a Flash paper, this is an attempt at an open source runnable implementation of that idea. The code is all there if you have the hardware to test it. I'd genuinely love to know if the projections hold. GitHub: Micro Expert Router https://github.com/randyap8-wq/Micro-Expert-Router-SSD-Streamed-MoE-MER