cd /news/large-language-models/i-built-a-rust-inference-engine-that… · home topics large-language-models article
[ARTICLE · art-14825] src=dev.to pub= topic=large-language-models verified=true sentiment=↑ positive

I built a Rust inference engine that streams MoE expert weights from NVMe SSDs, no GPU required

A developer built Micro-Expert-Router, a Rust inference engine that streams Mixture-of-Experts model weights directly from NVMe SSDs using io_uring with O_DIRECT, eliminating the need for GPU VRAM. The engine targets models like Mixtral 8x7B by loading only the 2 of 8 active experts needed per token step, reducing the active weight footprint from 24GB to approximately 6GB. The system exposes an OpenAI-compatible HTTP API with SSE streaming and includes multi-tier caching, quantization support, speculative decoding, and continuous batching, though performance projections of 11–15 tokens per second remain unverified due to lack of benchmark hardware.

read2 min publishedMay 27, 2026

Most people trying to run Mixtral or DeepSeek-V3 locally hit the same wall: they don't have 80GB of VRAM. The common answer is "get better hardware." I wanted to see if there was another way.

The idea is straightforward. Based on Apple’s landmark research paper, titled "LLM in a flash: Efficient Large Language Model Inference with Limited Memory" NVMe SSDs have gotten fast enough, PCIe Gen5 arrays are hitting ~56 GB/s, so you can treat them as a first-class memory tier for LLM inference instead of just storage. For Mixture-of-Experts models specifically, this is interesting because at any given token step, you only need 2 of 8 experts active. That's ~6GB of active weights on Mixtral 8x7B, not 24GB.

Micro-Expert-Router is the result. It's a Rust inference engine that streams MoE expert weights directly from NVMe via io_uring with O_DIRECT, routes tokens through real SwiGLU FFN kernels, and exposes an OpenAI-compatible HTTP API with SSE streaming.

What's in it:

SSD-streamed expert via io_uring fixed buffers and O_DIRECT pread

Multi-tier expert cache: SSD → RAM (LRU with pinning) → VRAM Q4_0, Q4K, Q8_0, F16 quantization with AVX2/AVX-512/AMX dispatch

Speculative decoding with a draft engine tied to the main model embeddings

Continuous batching with weighted round-robin admission

SafeTensors , SIGHUP hot reload, TUI dashboard, Helm chart

Honest disclaimer on the numbers:

I don't have the hardware to run full benchmarks yet. The telemetry figures in the repo (11–15 tokens/sec across edge workstation, sovereign box, and RPC sharded cluster topologies) are theoretical ceilings derived from active weight footprint and raw NVMe sequential bandwidth at 80% cache hit rate — not measured results. Cold I/O latency projections range from 108ms on a Quad Gen5 U.2 array down to 1010ms on a PCIe Gen4 M.2. The closest published prior art is Apple's LLM in a Flash paper, this is an attempt at an open source runnable implementation of that idea.

The code is all there if you have the hardware to test it. I'd genuinely love to know if the projections hold.

GitHub: Micro Expert Router

── more in #large-language-models 4 stories · sorted by recency
sponsored brought to you by zahid.host 4,200+ EU-deployed projects
reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main
Live at https://your-agent.zahid.host
Get free account → Pricing
from €0/mo · no card required
LIVE [news/i-built-a-rust-infer…] indexed:0 read:2min 2026-05-27 ·