{"slug": "i-built-a-rust-inference-engine-that-streams-moe-expert-weights-from-nvme-ssds", "title": "I built a Rust inference engine that streams MoE expert weights from NVMe SSDs, no GPU required", "summary": "A developer built Micro-Expert-Router, a Rust inference engine that streams Mixture-of-Experts model weights directly from NVMe SSDs using io_uring with O_DIRECT, eliminating the need for GPU VRAM. The engine targets models like Mixtral 8x7B by loading only the 2 of 8 active experts needed per token step, reducing the active weight footprint from 24GB to approximately 6GB. The system exposes an OpenAI-compatible HTTP API with SSE streaming and includes multi-tier caching, quantization support, speculative decoding, and continuous batching, though performance projections of 11–15 tokens per second remain unverified due to lack of benchmark hardware.", "body_md": "[Most people trying to run Mixtral or DeepSeek-V3 locally hit the same wall: they don't have 80GB of VRAM. The common answer is \"get better hardware.\" I wanted to see if there was another way.](https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F989tsu88kkq3xvlco6w7.png)\n\nThe idea is straightforward. Based on Apple’s landmark research paper, titled \"LLM in a flash: Efficient Large Language Model Inference with Limited Memory\" NVMe SSDs have gotten fast enough, PCIe Gen5 arrays are hitting ~56 GB/s, so you can treat them as a first-class memory tier for LLM inference instead of just storage. For Mixture-of-Experts models specifically, this is interesting because at any given token step, you only need 2 of 8 experts active. That's ~6GB of active weights on Mixtral 8x7B, not 24GB.\n\nMicro-Expert-Router is the result. It's a Rust inference engine that streams MoE expert weights directly from NVMe via io_uring with O_DIRECT, routes tokens through real SwiGLU FFN kernels, and exposes an OpenAI-compatible HTTP API with SSE streaming.\n\nWhat's in it:\n\nSSD-streamed expert loading via io_uring fixed buffers and O_DIRECT pread\n\nMulti-tier expert cache: SSD → RAM (LRU with pinning) → VRAM\n\nQ4_0, Q4K, Q8_0, F16 quantization with AVX2/AVX-512/AMX dispatch\n\nSpeculative decoding with a draft engine tied to the main model embeddings\n\nContinuous batching with weighted round-robin admission\n\nSafeTensors loader, SIGHUP hot reload, TUI dashboard, Helm chart\n\nHonest disclaimer on the numbers:\n\nI don't have the hardware to run full benchmarks yet. The telemetry figures in the repo (11–15 tokens/sec across edge workstation, sovereign box, and RPC sharded cluster topologies) are theoretical ceilings derived from active weight footprint and raw NVMe sequential bandwidth at 80% cache hit rate — not measured results. Cold I/O latency projections range from 108ms on a Quad Gen5 U.2 array down to 1010ms on a PCIe Gen4 M.2. The closest published prior art is Apple's LLM in a Flash paper, this is an attempt at an open source runnable implementation of that idea.\n\nThe code is all there if you have the hardware to test it. I'd genuinely love to know if the projections hold.\n\nGitHub: [Micro Expert Router](https://github.com/randyap8-wq/Micro-Expert-Router-SSD-Streamed-MoE-MER)", "url": "https://wpnews.pro/news/i-built-a-rust-inference-engine-that-streams-moe-expert-weights-from-nvme-ssds", "canonical_source": "https://dev.to/randyap8wq/i-built-a-rust-inference-engine-that-streams-moe-expert-weights-from-nvme-ssds-no-gpu-required-3bie", "published_at": "2026-05-27 03:32:21+00:00", "updated_at": "2026-05-27 03:52:48.678625+00:00", "lang": "en", "topics": ["large-language-models", "artificial-intelligence", "ai-infrastructure", "ai-research", "machine-learning"], "entities": ["Mixtral", "DeepSeek-V3", "Apple", "NVMe", "Micro-Expert-Router", "Rust", "SwiGLU", "AVX2"], "alternates": {"html": "https://wpnews.pro/news/i-built-a-rust-inference-engine-that-streams-moe-expert-weights-from-nvme-ssds", "markdown": "https://wpnews.pro/news/i-built-a-rust-inference-engine-that-streams-moe-expert-weights-from-nvme-ssds.md", "text": "https://wpnews.pro/news/i-built-a-rust-inference-engine-that-streams-moe-expert-weights-from-nvme-ssds.txt", "jsonld": "https://wpnews.pro/news/i-built-a-rust-inference-engine-that-streams-moe-expert-weights-from-nvme-ssds.jsonld"}}