I ran GLM-5.1 on a 16GB RAM machine

wpnews.pro

cd /news/large-language-models/i-ran-glm-5-1-on-a-16gb-ram-machine · home › topics › large-language-models › article

[ARTICLE · art-15249] src=github.com ↗ pub=2026-05-27T12:31Z topic=large-language-models verified=true sentiment=↑ positive

I ran GLM-5.1 on a 16GB RAM machine

A team of engineers successfully ran the 754-billion parameter GLM-5.1 large language model on a consumer PC with only 16GB of RAM and a Ryzen 5 5600G CPU, achieving zero crashes or out-of-memory errors. The experiment, part of the MoE-on-a-Potato project, demonstrated that physical memory is no longer the hard limit for local Mixture-of-Experts inference, with the model using just 8.34GB of system RAM by streaming 176GB of weights from an NVMe SSD. The benchmark established that SSD read bandwidth, not RAM capacity, is now the primary bottleneck for running massive AI models on budget hardware.

read4 min views12 publishedMay 27, 2026

"Saying it's impossible is not engineering. Saying we don't know how yet is science."

MoE-on-a-Potato is an experimental project dedicated to testing the extreme limits of running massive Mixture-of-Experts (MoE) Large Language Models on consumer-grade, budget hardware.

We successfully ran GLM-5.1 (a 754B parameter model, 176GB GGUF size) on a Ryzen 5 5600G (6 Cores / 12 Threads) CPU, Vega 7 iGPU, and 16GB DDR4 RAM without crashing, establishing a scientific proof of concept for low-memory MoE disk-streaming inference.

For a fully interactive performance breakdown, memory scales, and expert cache projections, open the built-in HTML dashboard: 🔗 Interactive Phase 3 Dashboard

Our journey progressed through three scaling phases, measuring token generation speed, load times, and system peak RAM usage.

Metric	Phase 1: DeepSeek (16B MoE)	Phase 2: GLM-4.7-Flash (30B MoE)
Active Parameters
2.4B	3B	40B
Model GGUF Size
10.4 GB	18.5 GB	176.0 GB
File Location
Main SSD	Main SSD	Secondary NVMe Partition
Model Load Time
7.8 seconds	29.0 seconds	492.1 seconds (~8.2 Min)
Context Warmup Time
Negligible	Negligible	242.0 seconds (~4.0 Min)
Prompt Processing Speed
34.56 t/s
2.71 t/s	0.16 t/s (6.45 s/token)
Token Generation Speed
25.15 t/s
6.59 t/s
0.05 t/s (20.89 s/token)
Peak System RAM Usage
4.28 GB
4.41 GB
8.34 GB (Limit: 16 GB!)
Actual Model RAM Footprint
~2.50 GB	~3.00 GB	~6.80 GB

Goal: Build a customllama.cpp

compilation optimized for AVX2 and Vulkan backend to offload processing to the integrated Radeon Vega 7 iGPU.Outcome:- By configuring memory mapping ( --mmap

), we kept the RAM footprint to4.28 GB(a**%47 memory saving** over full-RAM ). - iGPU offload ( -ngl 10

) increased token generation speed to25.15 t/s(a** 33.1% speedup**compared to CPU-only).

By configuring memory mapping (

Goal: Run a 30B model (18.5GB) that physically exceeds the available free system RAM (~12-13GB after OS/APU overhead).Outcome:* Full RAM *was skipped due to guaranteed OOM crash. Undermmap

, the model booted successfully withonly 6.14 GB peak RAM.- iGPU offload ( -ngl 10

) pushed token generation to6.59 t/s while lowering the CPU-side RAM usage to3.0 GB(net** 51% RAM saving**).

Goal: Run a colossal 754-billion parameter model (176GB GGUF split files) on our 16GB RAM potato PC.Storage Setup: The weights were hosted on/media/osman/CC46433D46432792

, which is an NTFS-formatted partition of our PCIe Gen3 NVMe SSD. Under Linux, mounting NTFS partitions via FUSE/ntfs-3g creates driver overhead, throttling sequential reads to**~650 MB/s** and adding CPU wait times.Outcome:**0 Crashing / 0 OOM Errors: The model initialized and mapped 176GB of virtual memory with a maximum system RAM footprint ofonly 8.34 GB**!** Inference Speeds:Generated tokens at 0.05 t/s**(20.89 seconds per token) with prompt processing at** 0.16 t/s**.

Our Phase 3 benchmark provided empirical proof that physical memory capacity is no longer the hard limit for local MoE execution; instead, the bottleneck is SSD read bandwidth.

During token generation, 8 routing experts (~40B active parameters) must be read from the disk per token. In IQ1_M

(1.6 bits/weight average), this equates to approximately 10 GB of weights per forward pass.

The minimal 5.5-second delta represents the CPU compute overhead (AVX2 layer execution). This tight correlation proves that local MoE performance scale is directly proportional to SSD throughput.

To address the SSD I/O bottleneck, we proposed an Expert Cache/Pinning Layer architecture for llama.cpp

and GGML

Mixture-of-Experts activations follow a highly skewed Zipfian Power-Law distribution (

Pinning just 12 out of 64 routed experts(only ~18% of the model parameters) in fast memory (GPU VRAM or locked physical RAM viamlock

) achieves a**%73 Cache Hit Rate**. - This drops the average disk read per token from 10 GB to just 2.7 GB, projecting speeds to 0.24 t/s (~4.2 seconds/token). - Pinning 24 experts achieves a85% hit rate, projecting speeds to** 0.43 t/s (~2.3 seconds/token)**—making a 754B model genuinely usable for slow offline batch agentic tasks.

                  ┌──────────────────────────────────────────┐
                  │            Expert Cache (mlock)          │
                  │        (12 Hot Experts - 73% Hit)        │
                  └────────────────────┬─────────────────────┘
                                       │
                    ┌──────────────────┴──────────────────┐
                    ▼                                     ▼
      [Cache Hit (73% of tokens)]            [Cache Miss (27% of tokens)]
         Read from RAM/VRAM                      Page-in from NVMe SSD
            Latency: ~0.1s                          Latency: ~15.3s

Detailed implementation proposals, code structures, and GGML modifications can be found in the expert_cache_design.md file.

monitor.py

: Real-time logging of CPU, system RAM, and VRAM utilization.expert_profiler.py

: Analyzes expert routing distributions under local execution.expert_cache_design.md

: Architecture design and GGML code changes for MoE hot-expert pinning.experiments/

:faz1_deepseek_v2_lite/

: Logs and benchmark summaries for the 16B model.faz2_glm_4.7_flash/

: Performance metrics and meta-conversation logs for the 30B model.faz3_glm_5.1_iq1_m/

: Metric reports, dashboard logs, and the HTML dashboard for the 754B model.

“Don't tell us it won't work. We'll build it, run it, and benchmark it.” 🚀

source & further reading

github.com — original article

~/api · this article 200

$curl api.wpnews.pro/v1/news/i-ran-glm-5-1-on-a-16gb-…

Read original on github.com → github.com/snrj35-dev/754B-on-a-Potato

mentioned entities

GLM-5.1

MoE-on-a-Potato

Ryzen 5 5600G

Vega 7 iGPU

DeepSeek

GLM-4.7-Flash

metadata

slugi-ran-glm-5-1-on-a-16gb-ram-machine

topic#large-language-models

secondary4 topics

sentimentpositive

canonicalgithub.com

navigation

← prevVibe Coding vs. Real Coding: Why…

next →The All-Consuming AI Boom Forces…

── more in #large-language-models 4 stories · sorted by recency

github.com · 11 Jul · #large-language-models

GitHub – PolymathicAI/The_well: A 15TB Collection of Physics Simulation Datasets

huggingface.co · 11 Jul · #large-language-models

Introduction to Reinforcement Learning and Its Role in LLMs

pub.towardsai.net · 11 Jul · #large-language-models

GPT-5.6 Sol, Terra, and Luna: OpenAI’s New Naming Scheme Is Actually a Strategy

pathtostaff.com · 11 Jul · #large-language-models

How a Pile of Random Numbers Learns to Talk

── more on @glm-5.1 3 stories trending now

wpnews · 30 May · #ai-safety

Nightcord Security Analysis Report - Threat Investigation

wpnews · 27 May · #artificial-intelligence

How I Run Two Claude Accounts as One

wpnews · 8 Jul · #artificial-intelligence

SpaceXAI unveils Grok 4.5 AI model ahead of July 2026 public release

sponsored brought to you by zahid.host 4,200+ EU-deployed projects

reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main

→ Live at https://your-agent.zahid.host ✓

Get free account → Pricing

from €0/mo · no card required