# I ran GLM-5.1 on a 16GB RAM machine

> Source: <https://github.com/snrj35-dev/754B-on-a-Potato>
> Published: 2026-05-27 12:31:00+00:00

"Saying it's impossible is not engineering. Saying we don't know how yet is science."

**MoE-on-a-Potato** is an experimental project dedicated to testing the extreme limits of running massive Mixture-of-Experts (MoE) Large Language Models on consumer-grade, budget hardware.

We successfully ran **GLM-5.1** (a **754B parameter model**, 176GB GGUF size) on a **Ryzen 5 5600G (6 Cores / 12 Threads) CPU, Vega 7 iGPU, and 16GB DDR4 RAM** without crashing, establishing a scientific proof of concept for low-memory MoE disk-streaming inference.

For a fully interactive performance breakdown, memory scales, and expert cache projections, open the built-in HTML dashboard:
🔗 [Interactive Phase 3 Dashboard](/snrj35-dev/754B-on-a-Potato/blob/main/experiments/faz3_glm_5.1_iq1_m/moe_potato_final_dashboard.html)

Our journey progressed through three scaling phases, measuring token generation speed, load times, and system peak RAM usage.

| Metric | Phase 1: DeepSeek (16B MoE) | Phase 2: GLM-4.7-Flash (30B MoE) | Phase 3: GLM-5.1 (754B MoE) 🚀 |
|---|---|---|---|
Active Parameters |
2.4B | 3B | 40B |
Model GGUF Size |
10.4 GB | 18.5 GB | 176.0 GB |
File Location |
Main SSD | Main SSD | Secondary NVMe Partition |
Model Load Time |
7.8 seconds | 29.0 seconds | 492.1 seconds (~8.2 Min) |
Context Warmup Time |
Negligible | Negligible | 242.0 seconds (~4.0 Min) |
Prompt Processing Speed |
34.56 t/s |
2.71 t/s | 0.16 t/s (6.45 s/token) |
Token Generation Speed |
25.15 t/s |
6.59 t/s |
0.05 t/s (20.89 s/token) |
Peak System RAM Usage |
4.28 GB |
4.41 GB |
8.34 GB (Limit: 16 GB!) |
Actual Model RAM Footprint |
~2.50 GB | ~3.00 GB | ~6.80 GB |

**Goal:** Build a custom`llama.cpp`

compilation optimized for AVX2 and Vulkan backend to offload processing to the integrated Radeon Vega 7 iGPU.**Outcome:**- By configuring memory mapping (
`--mmap`

), we kept the RAM footprint to**4.28 GB**(a**%47 memory saving** over full-RAM loading). - iGPU offload (
`-ngl 10`

) increased token generation speed to**25.15 t/s**(a** 33.1% speedup**compared to CPU-only).

- By configuring memory mapping (

**Goal:** Run a 30B model (18.5GB) that physically exceeds the available free system RAM (~12-13GB after OS/APU overhead).**Outcome:*** Full RAM loading*was skipped due to guaranteed OOM crash. Under`mmap`

, the model booted successfully with**only 6.14 GB** peak RAM.- iGPU offload (
`-ngl 10`

) pushed token generation to**6.59 t/s** while lowering the CPU-side RAM usage to**3.0 GB**(net** 51% RAM saving**).

**Goal:** Run a colossal 754-billion parameter model (176GB GGUF split files) on our 16GB RAM potato PC.**Storage Setup:** The weights were hosted on`/media/osman/CC46433D46432792`

, which is an NTFS-formatted partition of our PCIe Gen3 NVMe SSD. Under Linux, mounting NTFS partitions via FUSE/ntfs-3g creates driver overhead, throttling sequential reads to**~650 MB/s** and adding CPU wait times.**Outcome:****0 Crashing / 0 OOM Errors:** The model initialized and mapped 176GB of virtual memory with a maximum system RAM footprint of**only 8.34 GB**!** Inference Speeds:**Generated tokens at** 0.05 t/s**(20.89 seconds per token) with prompt processing at** 0.16 t/s**.

Our Phase 3 benchmark provided **empirical proof** that physical memory capacity is no longer the hard limit for local MoE execution; instead, the bottleneck is **SSD read bandwidth**.

During token generation, 8 routing experts (~40B active parameters) must be read from the disk per token. In `IQ1_M`

(1.6 bits/weight average), this equates to approximately **10 GB of weights** per forward pass.

The minimal **5.5-second delta** represents the CPU compute overhead (AVX2 layer execution). This tight correlation proves that **local MoE performance scale is directly proportional to SSD throughput**.

To address the SSD I/O bottleneck, we proposed an **Expert Cache/Pinning Layer** architecture for `llama.cpp`

and `GGML`

.

Mixture-of-Experts activations follow a highly skewed **Zipfian Power-Law distribution** (

- Pinning just
**12 out of 64 routed experts**(only ~18% of the model parameters) in fast memory (GPU VRAM or locked physical RAM via`mlock`

) achieves a**%73 Cache Hit Rate**. - This drops the average disk read per token from 10 GB to just 2.7 GB, projecting speeds to
**0.24 t/s (~4.2 seconds/token)**. - Pinning
**24 experts** achieves a**85% hit rate**, projecting speeds to** 0.43 t/s (~2.3 seconds/token)**—making a 754B model genuinely usable for slow offline batch agentic tasks.

```
                  ┌──────────────────────────────────────────┐
                  │            Expert Cache (mlock)          │
                  │        (12 Hot Experts - 73% Hit)        │
                  └────────────────────┬─────────────────────┘
                                       │
                    ┌──────────────────┴──────────────────┐
                    ▼                                     ▼
      [Cache Hit (73% of tokens)]            [Cache Miss (27% of tokens)]
         Read from RAM/VRAM                      Page-in from NVMe SSD
            Latency: ~0.1s                          Latency: ~15.3s
```

Detailed implementation proposals, code structures, and GGML modifications can be found in the [expert_cache_design.md](/snrj35-dev/754B-on-a-Potato/blob/main/expert_cache_design.md) file.

`monitor.py`

: Real-time logging of CPU, system RAM, and VRAM utilization.`expert_profiler.py`

: Analyzes expert routing distributions under local execution.`expert_cache_design.md`

: Architecture design and GGML code changes for MoE hot-expert pinning.`experiments/`

:`faz1_deepseek_v2_lite/`

: Logs and benchmark summaries for the 16B model.`faz2_glm_4.7_flash/`

: Performance metrics and meta-conversation logs for the 30B model.`faz3_glm_5.1_iq1_m/`

: Metric reports, dashboard logs, and the HTML dashboard for the 754B model.

*“Don't tell us it won't work. We'll build it, run it, and benchmark it.”* 🚀
