cd /news/large-language-models/i-ran-glm-5-1-on-a-16gb-ram-machine Β· home β€Ί topics β€Ί large-language-models β€Ί article
[ARTICLE Β· art-15249] src=github.com pub= topic=large-language-models verified=true sentiment=↑ positive

I ran GLM-5.1 on a 16GB RAM machine

A team of engineers successfully ran the 754-billion parameter GLM-5.1 large language model on a consumer PC with only 16GB of RAM and a Ryzen 5 5600G CPU, achieving zero crashes or out-of-memory errors. The experiment, part of the MoE-on-a-Potato project, demonstrated that physical memory is no longer the hard limit for local Mixture-of-Experts inference, with the model using just 8.34GB of system RAM by streaming 176GB of weights from an NVMe SSD. The benchmark established that SSD read bandwidth, not RAM capacity, is now the primary bottleneck for running massive AI models on budget hardware.

read4 min publishedMay 27, 2026

"Saying it's impossible is not engineering. Saying we don't know how yet is science."

MoE-on-a-Potato is an experimental project dedicated to testing the extreme limits of running massive Mixture-of-Experts (MoE) Large Language Models on consumer-grade, budget hardware.

We successfully ran GLM-5.1 (a 754B parameter model, 176GB GGUF size) on a Ryzen 5 5600G (6 Cores / 12 Threads) CPU, Vega 7 iGPU, and 16GB DDR4 RAM without crashing, establishing a scientific proof of concept for low-memory MoE disk-streaming inference.

For a fully interactive performance breakdown, memory scales, and expert cache projections, open the built-in HTML dashboard: πŸ”— Interactive Phase 3 Dashboard

Our journey progressed through three scaling phases, measuring token generation speed, load times, and system peak RAM usage.

Metric Phase 1: DeepSeek (16B MoE) Phase 2: GLM-4.7-Flash (30B MoE) Phase 3: GLM-5.1 (754B MoE) πŸš€
Active Parameters
2.4B 3B 40B
Model GGUF Size
10.4 GB 18.5 GB 176.0 GB
File Location
Main SSD Main SSD Secondary NVMe Partition
Model Load Time
7.8 seconds 29.0 seconds 492.1 seconds (~8.2 Min)
Context Warmup Time
Negligible Negligible 242.0 seconds (~4.0 Min)
Prompt Processing Speed
34.56 t/s
2.71 t/s 0.16 t/s (6.45 s/token)
Token Generation Speed
25.15 t/s
6.59 t/s
0.05 t/s (20.89 s/token)
Peak System RAM Usage
4.28 GB
4.41 GB
8.34 GB (Limit: 16 GB!)
Actual Model RAM Footprint
~2.50 GB ~3.00 GB ~6.80 GB

Goal: Build a customllama.cpp

compilation optimized for AVX2 and Vulkan backend to offload processing to the integrated Radeon Vega 7 iGPU.Outcome:- By configuring memory mapping ( --mmap

), we kept the RAM footprint to4.28 GB(a**%47 memory saving** over full-RAM ). - iGPU offload ( -ngl 10

) increased token generation speed to25.15 t/s(a** 33.1% speedup**compared to CPU-only).

  • By configuring memory mapping (

Goal: Run a 30B model (18.5GB) that physically exceeds the available free system RAM (~12-13GB after OS/APU overhead).Outcome:* Full RAM *was skipped due to guaranteed OOM crash. Undermmap

, the model booted successfully withonly 6.14 GB peak RAM.- iGPU offload ( -ngl 10

) pushed token generation to6.59 t/s while lowering the CPU-side RAM usage to3.0 GB(net** 51% RAM saving**).

Goal: Run a colossal 754-billion parameter model (176GB GGUF split files) on our 16GB RAM potato PC.Storage Setup: The weights were hosted on/media/osman/CC46433D46432792

, which is an NTFS-formatted partition of our PCIe Gen3 NVMe SSD. Under Linux, mounting NTFS partitions via FUSE/ntfs-3g creates driver overhead, throttling sequential reads to**~650 MB/s** and adding CPU wait times.Outcome:**0 Crashing / 0 OOM Errors: The model initialized and mapped 176GB of virtual memory with a maximum system RAM footprint ofonly 8.34 GB**!** Inference Speeds:Generated tokens at 0.05 t/s**(20.89 seconds per token) with prompt processing at** 0.16 t/s**.

Our Phase 3 benchmark provided empirical proof that physical memory capacity is no longer the hard limit for local MoE execution; instead, the bottleneck is SSD read bandwidth.

During token generation, 8 routing experts (~40B active parameters) must be read from the disk per token. In IQ1_M

(1.6 bits/weight average), this equates to approximately 10 GB of weights per forward pass.

The minimal 5.5-second delta represents the CPU compute overhead (AVX2 layer execution). This tight correlation proves that local MoE performance scale is directly proportional to SSD throughput.

To address the SSD I/O bottleneck, we proposed an Expert Cache/Pinning Layer architecture for llama.cpp

and GGML

.

Mixture-of-Experts activations follow a highly skewed Zipfian Power-Law distribution (

  • Pinning just 12 out of 64 routed experts(only ~18% of the model parameters) in fast memory (GPU VRAM or locked physical RAM viamlock

) achieves a**%73 Cache Hit Rate**. - This drops the average disk read per token from 10 GB to just 2.7 GB, projecting speeds to 0.24 t/s (~4.2 seconds/token). - Pinning 24 experts achieves a85% hit rate, projecting speeds to** 0.43 t/s (~2.3 seconds/token)**β€”making a 754B model genuinely usable for slow offline batch agentic tasks.

                  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                  β”‚            Expert Cache (mlock)          β”‚
                  β”‚        (12 Hot Experts - 73% Hit)        β”‚
                  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                       β”‚
                    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                    β–Ό                                     β–Ό
      [Cache Hit (73% of tokens)]            [Cache Miss (27% of tokens)]
         Read from RAM/VRAM                      Page-in from NVMe SSD
            Latency: ~0.1s                          Latency: ~15.3s

Detailed implementation proposals, code structures, and GGML modifications can be found in the expert_cache_design.md file.

monitor.py

: Real-time logging of CPU, system RAM, and VRAM utilization.expert_profiler.py

: Analyzes expert routing distributions under local execution.expert_cache_design.md

: Architecture design and GGML code changes for MoE hot-expert pinning.experiments/

:faz1_deepseek_v2_lite/

: Logs and benchmark summaries for the 16B model.faz2_glm_4.7_flash/

: Performance metrics and meta-conversation logs for the 30B model.faz3_glm_5.1_iq1_m/

: Metric reports, dashboard logs, and the HTML dashboard for the 754B model.

β€œDon't tell us it won't work. We'll build it, run it, and benchmark it.” πŸš€

── more in #large-language-models 4 stories Β· sorted by recency
sponsored brought to you by zahid.host 4,200+ EU-deployed projects
reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain β€” perfect for shipping the agent you just read about.

$git push zahid main
β†’ Live at https://your-agent.zahid.host βœ“
Get free account β†’ Pricing
from €0/mo Β· no card required
LIVE [news/i-ran-glm-5-1-on-a-1…] indexed:0 read:4min 2026-05-27 Β· β€”