"Saying it's impossible is not engineering. Saying we don't know how yet is science."
MoE-on-a-Potato is an experimental project dedicated to testing the extreme limits of running massive Mixture-of-Experts (MoE) Large Language Models on consumer-grade, budget hardware.
We successfully ran GLM-5.1 (a 754B parameter model, 176GB GGUF size) on a Ryzen 5 5600G (6 Cores / 12 Threads) CPU, Vega 7 iGPU, and 16GB DDR4 RAM without crashing, establishing a scientific proof of concept for low-memory MoE disk-streaming inference.
For a fully interactive performance breakdown, memory scales, and expert cache projections, open the built-in HTML dashboard: π Interactive Phase 3 Dashboard
Our journey progressed through three scaling phases, measuring token generation speed, load times, and system peak RAM usage.
| Metric | Phase 1: DeepSeek (16B MoE) | Phase 2: GLM-4.7-Flash (30B MoE) | Phase 3: GLM-5.1 (754B MoE) π |
|---|---|---|---|
| Active Parameters | |||
| 2.4B | 3B | 40B | |
| Model GGUF Size | |||
| 10.4 GB | 18.5 GB | 176.0 GB | |
| File Location | |||
| Main SSD | Main SSD | Secondary NVMe Partition | |
| Model Load Time | |||
| 7.8 seconds | 29.0 seconds | 492.1 seconds (~8.2 Min) | |
| Context Warmup Time | |||
| Negligible | Negligible | 242.0 seconds (~4.0 Min) | |
| Prompt Processing Speed | |||
| 34.56 t/s | |||
| 2.71 t/s | 0.16 t/s (6.45 s/token) | ||
| Token Generation Speed | |||
| 25.15 t/s | |||
| 6.59 t/s | |||
| 0.05 t/s (20.89 s/token) | |||
| Peak System RAM Usage | |||
| 4.28 GB | |||
| 4.41 GB | |||
| 8.34 GB (Limit: 16 GB!) | |||
| Actual Model RAM Footprint | |||
| ~2.50 GB | ~3.00 GB | ~6.80 GB |
Goal: Build a customllama.cpp
compilation optimized for AVX2 and Vulkan backend to offload processing to the integrated Radeon Vega 7 iGPU.Outcome:- By configuring memory mapping (
--mmap
), we kept the RAM footprint to4.28 GB(a**%47 memory saving** over full-RAM ). - iGPU offload (
-ngl 10
) increased token generation speed to25.15 t/s(a** 33.1% speedup**compared to CPU-only).
- By configuring memory mapping (
Goal: Run a 30B model (18.5GB) that physically exceeds the available free system RAM (~12-13GB after OS/APU overhead).Outcome:* Full RAM *was skipped due to guaranteed OOM crash. Undermmap
, the model booted successfully withonly 6.14 GB peak RAM.- iGPU offload (
-ngl 10
) pushed token generation to6.59 t/s while lowering the CPU-side RAM usage to3.0 GB(net** 51% RAM saving**).
Goal: Run a colossal 754-billion parameter model (176GB GGUF split files) on our 16GB RAM potato PC.Storage Setup: The weights were hosted on/media/osman/CC46433D46432792
, which is an NTFS-formatted partition of our PCIe Gen3 NVMe SSD. Under Linux, mounting NTFS partitions via FUSE/ntfs-3g creates driver overhead, throttling sequential reads to**~650 MB/s** and adding CPU wait times.Outcome:**0 Crashing / 0 OOM Errors: The model initialized and mapped 176GB of virtual memory with a maximum system RAM footprint ofonly 8.34 GB**!** Inference Speeds:Generated tokens at 0.05 t/s**(20.89 seconds per token) with prompt processing at** 0.16 t/s**.
Our Phase 3 benchmark provided empirical proof that physical memory capacity is no longer the hard limit for local MoE execution; instead, the bottleneck is SSD read bandwidth.
During token generation, 8 routing experts (~40B active parameters) must be read from the disk per token. In IQ1_M
(1.6 bits/weight average), this equates to approximately 10 GB of weights per forward pass.
The minimal 5.5-second delta represents the CPU compute overhead (AVX2 layer execution). This tight correlation proves that local MoE performance scale is directly proportional to SSD throughput.
To address the SSD I/O bottleneck, we proposed an Expert Cache/Pinning Layer architecture for llama.cpp
and GGML
.
Mixture-of-Experts activations follow a highly skewed Zipfian Power-Law distribution (
- Pinning just
12 out of 64 routed experts(only ~18% of the model parameters) in fast memory (GPU VRAM or locked physical RAM via
mlock
) achieves a**%73 Cache Hit Rate**. - This drops the average disk read per token from 10 GB to just 2.7 GB, projecting speeds to 0.24 t/s (~4.2 seconds/token). - Pinning 24 experts achieves a85% hit rate, projecting speeds to** 0.43 t/s (~2.3 seconds/token)**βmaking a 754B model genuinely usable for slow offline batch agentic tasks.
ββββββββββββββββββββββββββββββββββββββββββββ
β Expert Cache (mlock) β
β (12 Hot Experts - 73% Hit) β
ββββββββββββββββββββββ¬ββββββββββββββββββββββ
β
ββββββββββββββββββββ΄βββββββββββββββββββ
βΌ βΌ
[Cache Hit (73% of tokens)] [Cache Miss (27% of tokens)]
Read from RAM/VRAM Page-in from NVMe SSD
Latency: ~0.1s Latency: ~15.3s
Detailed implementation proposals, code structures, and GGML modifications can be found in the expert_cache_design.md file.
monitor.py
: Real-time logging of CPU, system RAM, and VRAM utilization.expert_profiler.py
: Analyzes expert routing distributions under local execution.expert_cache_design.md
: Architecture design and GGML code changes for MoE hot-expert pinning.experiments/
:faz1_deepseek_v2_lite/
: Logs and benchmark summaries for the 16B model.faz2_glm_4.7_flash/
: Performance metrics and meta-conversation logs for the 30B model.faz3_glm_5.1_iq1_m/
: Metric reports, dashboard logs, and the HTML dashboard for the 754B model.
βDon't tell us it won't work. We'll build it, run it, and benchmark it.β π