I ran GLM-5.1 on a 16GB RAM machine

A team of engineers successfully ran the 754-billion parameter GLM-5.1 large language model on a consumer PC with only 16GB of RAM and a Ryzen 5 5600G CPU, achieving zero crashes or out-of-memory errors. The experiment, part of the MoE-on-a-Potato project, demonstrated that physical memory is no longer the hard limit for local Mixture-of-Experts inference, with the model using just 8.34GB of system RAM by streaming 176GB of weights from an NVMe SSD. The benchmark established that SSD read bandwidth, not RAM capacity, is now the primary bottleneck for running massive AI models on budget hardware.

"Saying it's impossible is not engineering. Saying we don't know how yet is science." MoE-on-a-Potato is an experimental project dedicated to testing the extreme limits of running massive Mixture-of-Experts MoE Large Language Models on consumer-grade, budget hardware. We successfully ran GLM-5.1 a 754B parameter model , 176GB GGUF size on a Ryzen 5 5600G 6 Cores / 12 Threads CPU, Vega 7 iGPU, and 16GB DDR4 RAM without crashing, establishing a scientific proof of concept for low-memory MoE disk-streaming inference. For a fully interactive performance breakdown, memory scales, and expert cache projections, open the built-in HTML dashboard: 🔗 Interactive Phase 3 Dashboard /snrj35-dev/754B-on-a-Potato/blob/main/experiments/faz3 glm 5.1 iq1 m/moe potato final dashboard.html Our journey progressed through three scaling phases, measuring token generation speed, load times, and system peak RAM usage. | Metric | Phase 1: DeepSeek 16B MoE | Phase 2: GLM-4.7-Flash 30B MoE | Phase 3: GLM-5.1 754B MoE 🚀 | |---|---|---|---| Active Parameters | 2.4B | 3B | 40B | Model GGUF Size | 10.4 GB | 18.5 GB | 176.0 GB | File Location | Main SSD | Main SSD | Secondary NVMe Partition | Model Load Time | 7.8 seconds | 29.0 seconds | 492.1 seconds ~8.2 Min | Context Warmup Time | Negligible | Negligible | 242.0 seconds ~4.0 Min | Prompt Processing Speed | 34.56 t/s | 2.71 t/s | 0.16 t/s 6.45 s/token | Token Generation Speed | 25.15 t/s | 6.59 t/s | 0.05 t/s 20.89 s/token | Peak System RAM Usage | 4.28 GB | 4.41 GB | 8.34 GB Limit: 16 GB | Actual Model RAM Footprint | ~2.50 GB | ~3.00 GB | ~6.80 GB | Goal: Build a custom llama.cpp compilation optimized for AVX2 and Vulkan backend to offload processing to the integrated Radeon Vega 7 iGPU. Outcome: - By configuring memory mapping --mmap , we kept the RAM footprint to 4.28 GB a %47 memory saving over full-RAM loading . - iGPU offload -ngl 10 increased token generation speed to 25.15 t/s a 33.1% speedup compared to CPU-only . - By configuring memory mapping Goal: Run a 30B model 18.5GB that physically exceeds the available free system RAM ~12-13GB after OS/APU overhead . Outcome: Full RAM loading was skipped due to guaranteed OOM crash. Under mmap , the model booted successfully with only 6.14 GB peak RAM.- iGPU offload -ngl 10 pushed token generation to 6.59 t/s while lowering the CPU-side RAM usage to 3.0 GB net 51% RAM saving . Goal: Run a colossal 754-billion parameter model 176GB GGUF split files on our 16GB RAM potato PC. Storage Setup: The weights were hosted on /media/osman/CC46433D46432792 , which is an NTFS-formatted partition of our PCIe Gen3 NVMe SSD. Under Linux, mounting NTFS partitions via FUSE/ntfs-3g creates driver overhead, throttling sequential reads to ~650 MB/s and adding CPU wait times. Outcome: 0 Crashing / 0 OOM Errors: The model initialized and mapped 176GB of virtual memory with a maximum system RAM footprint of only 8.34 GB Inference Speeds: Generated tokens at 0.05 t/s 20.89 seconds per token with prompt processing at 0.16 t/s . Our Phase 3 benchmark provided empirical proof that physical memory capacity is no longer the hard limit for local MoE execution; instead, the bottleneck is SSD read bandwidth . During token generation, 8 routing experts ~40B active parameters must be read from the disk per token. In IQ1 M 1.6 bits/weight average , this equates to approximately 10 GB of weights per forward pass. The minimal 5.5-second delta represents the CPU compute overhead AVX2 layer execution . This tight correlation proves that local MoE performance scale is directly proportional to SSD throughput . To address the SSD I/O bottleneck, we proposed an Expert Cache/Pinning Layer architecture for llama.cpp and GGML . Mixture-of-Experts activations follow a highly skewed Zipfian Power-Law distribution - Pinning just 12 out of 64 routed experts only ~18% of the model parameters in fast memory GPU VRAM or locked physical RAM via mlock achieves a %73 Cache Hit Rate . - This drops the average disk read per token from 10 GB to just 2.7 GB, projecting speeds to 0.24 t/s ~4.2 seconds/token . - Pinning 24 experts achieves a 85% hit rate , projecting speeds to 0.43 t/s ~2.3 seconds/token —making a 754B model genuinely usable for slow offline batch agentic tasks. ┌──────────────────────────────────────────┐ │ Expert Cache mlock │ │ 12 Hot Experts - 73% Hit │ └────────────────────┬─────────────────────┘ │ ┌──────────────────┴──────────────────┐ ▼ ▼ Cache Hit 73% of tokens Cache Miss 27% of tokens Read from RAM/VRAM Page-in from NVMe SSD Latency: ~0.1s Latency: ~15.3s Detailed implementation proposals, code structures, and GGML modifications can be found in the expert cache design.md /snrj35-dev/754B-on-a-Potato/blob/main/expert cache design.md file. monitor.py : Real-time logging of CPU, system RAM, and VRAM utilization. expert profiler.py : Analyzes expert routing distributions under local execution. expert cache design.md : Architecture design and GGML code changes for MoE hot-expert pinning. experiments/ : faz1 deepseek v2 lite/ : Logs and benchmark summaries for the 16B model. faz2 glm 4.7 flash/ : Performance metrics and meta-conversation logs for the 30B model. faz3 glm 5.1 iq1 m/ : Metric reports, dashboard logs, and the HTML dashboard for the 754B model. “Don't tell us it won't work. We'll build it, run it, and benchmark it.” 🚀