I built a Rust inference engine that streams MoE expert weights from NVMe SSDs, no GPU required

wpnews.pro

cd /news/large-language-models/i-built-a-rust-inference-engine-that… · home › topics › large-language-models › article

[ARTICLE · art-14825] src=dev.to ↗ pub=2026-05-27T03:32Z topic=large-language-models verified=true sentiment=↑ positive

I built a Rust inference engine that streams MoE expert weights from NVMe SSDs, no GPU required

A developer built Micro-Expert-Router, a Rust inference engine that streams Mixture-of-Experts model weights directly from NVMe SSDs using io_uring with O_DIRECT, eliminating the need for GPU VRAM. The engine targets models like Mixtral 8x7B by loading only the 2 of 8 active experts needed per token step, reducing the active weight footprint from 24GB to approximately 6GB. The system exposes an OpenAI-compatible HTTP API with SSE streaming and includes multi-tier caching, quantization support, speculative decoding, and continuous batching, though performance projections of 11–15 tokens per second remain unverified due to lack of benchmark hardware.

read2 min views14 publishedMay 27, 2026

Most people trying to run Mixtral or DeepSeek-V3 locally hit the same wall: they don't have 80GB of VRAM. The common answer is "get better hardware." I wanted to see if there was another way.

The idea is straightforward. Based on Apple’s landmark research paper, titled "LLM in a flash: Efficient Large Language Model Inference with Limited Memory" NVMe SSDs have gotten fast enough, PCIe Gen5 arrays are hitting ~56 GB/s, so you can treat them as a first-class memory tier for LLM inference instead of just storage. For Mixture-of-Experts models specifically, this is interesting because at any given token step, you only need 2 of 8 experts active. That's ~6GB of active weights on Mixtral 8x7B, not 24GB.

Micro-Expert-Router is the result. It's a Rust inference engine that streams MoE expert weights directly from NVMe via io_uring with O_DIRECT, routes tokens through real SwiGLU FFN kernels, and exposes an OpenAI-compatible HTTP API with SSE streaming.

What's in it:

SSD-streamed expert via io_uring fixed buffers and O_DIRECT pread

Multi-tier expert cache: SSD → RAM (LRU with pinning) → VRAM Q4_0, Q4K, Q8_0, F16 quantization with AVX2/AVX-512/AMX dispatch

Speculative decoding with a draft engine tied to the main model embeddings

Continuous batching with weighted round-robin admission

SafeTensors , SIGHUP hot reload, TUI dashboard, Helm chart

Honest disclaimer on the numbers:

I don't have the hardware to run full benchmarks yet. The telemetry figures in the repo (11–15 tokens/sec across edge workstation, sovereign box, and RPC sharded cluster topologies) are theoretical ceilings derived from active weight footprint and raw NVMe sequential bandwidth at 80% cache hit rate — not measured results. Cold I/O latency projections range from 108ms on a Quad Gen5 U.2 array down to 1010ms on a PCIe Gen4 M.2. The closest published prior art is Apple's LLM in a Flash paper, this is an attempt at an open source runnable implementation of that idea.

The code is all there if you have the hardware to test it. I'd genuinely love to know if the projections hold.

GitHub: Micro Expert Router

source & further reading

dev.to — original article 6 Months Later, Nobody Could Read the Code — Including Me I kept leaving my terminal. ReskPoints: AI Agent Logging with Sampling, Masking, and Multi-Export

~/api · this article 200

$curl api.wpnews.pro/v1/news/i-built-a-rust-inference…

Read original on dev.to → dev.to/randyap8wq/i-built-a-rust-inference-engin…

mentioned entities

Mixtral

DeepSeek-V3

Apple

NVMe

Micro-Expert-Router

Rust

SwiGLU

AVX2

metadata

slugi-built-a-rust-inference-engine-that-streams-moe-expert-weights-from-nvme-ssds

topic#large-language-models

secondary4 topics

sentimentpositive

canonicaldev.to

navigation

← prevOpen vs Closed LLMs in 2026: The…

next →How to Brier-grade your own ML o…

── more in #large-language-models 4 stories · sorted by recency

machinebrief.com · 11 Jul · #large-language-models

Transformer Efficiency: A Closer Look at KV Cache Compression

outofcontext.dev · 27 Jun · #large-language-models

Why a 30B model can run like a 3B: dense vs MoE for running models locally

dev.to · 4 Jun · #large-language-models

Running Mixtral 8x7B at 21+ TPS on Pure CPU via io_uring and Predictive Caching

dev.to · 11 Jul · #large-language-models

What Bun’s Rust Rewrite Tells Us About Rebuilding the AI Infrastructure Layer in C#

── more on @mixtral 3 stories trending now

wpnews · 30 May · #ai-safety

Nightcord Security Analysis Report - Threat Investigation

wpnews · 27 May · #artificial-intelligence

How I Run Two Claude Accounts as One

wpnews · 8 Jul · #artificial-intelligence

AI Tokenomics: How to tokenmin while ROImaxxing

sponsored brought to you by zahid.host 4,200+ EU-deployed projects

reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main

→ Live at https://your-agent.zahid.host ✓

Get free account → Pricing

from €0/mo · no card required