Local LLMs on my TrueNAS, and the frontend I had to build

wpnews.pro

cd /news/large-language-models/local-llms-on-my-truenas-and-the-fro… · home › topics › large-language-models › article

[ARTICLE · art-45254] src=loomcycle.dev ↗ pub=2026-06-28T12:00Z topic=large-language-models verified=true sentiment=· neutral

Local LLMs on my TrueNAS, and the frontend I had to build

A developer upgraded a TrueNAS NAS with an AMD Ryzen 7 8700G APU and 96 GB DDR5 for local LLM inference, achieving 13-15 tok/s on gemma4, after rejecting pricier options like NVIDIA DGX Spark and Mac Studio. The build cost ~EUR 2,100, half the entry price of alternatives, and required workarounds for ROCm support and thermal management. The developer also built a custom frontend after finding Open WebUI inadequate for their workflow.

read4 min views14 publishedJun 28, 2026

Field log from upgrading a lab NAS (Intel N100, 16 GB DDR5) into one box that hosts product-test VMs (JobEmber.ai plus a sibling SaaS in stealth pre-release), the loomcycle multi-replica server, and local LLM inference. The constraint framed every other decision: no spare $4500-5500 for an NVIDIA DGX Spark, a Mac Studio with serious unified memory in the same band, Strix Halo (Ryzen AI MAX) starts around EUR 4,000 / $5,000 in Europe and everything is soldered. That reframe ruled out the Spark on price, Strix Halo on price AND rigidity, and a discrete-GPU build on cost-per-model-GB and thermal-envelope grounds. Total build cost: ~EUR 2,100 (Ryzen 7 8700G + 96 GB DDR5 + motherboard + new PSU), roughly half the entry price of the rejected options. The answer was upgrade the existing NAS: AM5 socket (swappable chip), DIMM DDR5 (upgradeable capacity and timing), an APU as the inference engine, and a clean upgrade path for the next-generation Ryzen APU. With that locked, final build: AMD Ryzen 7 8700G with 96 GB of DDR5, doubling as the existing TrueNAS NAS. An APU is not the same as a desktop CPU with integrated graphics; the 8700G's Radeon 780M (12 CUs) is the entry point, the 2-CU iGPUs on regular Ryzen and Intel chips are useless for inference; there is no 12-core or 16-core APU with a strong iGPU in AM5. Memory bandwidth not core count is the bottleneck; DDR5-6000 CL30 EXPO is the AM5 sweet spot (Phoenix controller tops out around 6000-6400 MT/s with two DIMMs); a DDR5-8000 kit downclocks. Kit suffix encodes the profile (Corsair Z = EXPO, C = XMP). Migration: fresh-install plus config restore (don't clone the boot pool); ZFS data pools are portable via zpool import; bigger disks use ZFS replication; anything outside the GUI doesn't transfer. gfx1103 is not officially supported by ROCm; force HSA_OVERRIDE_GFX_VERSION=11.0.2 + OLLAMA_IGPU_ENABLE=1; if rocBLAS errors on TensileLibrary.dat install prebuilt gfx1103 Tensile kernels. Real-workload throughput on this box: gemma4:latest at 13-15 tok/s; qwen3.6:latest at 9-12 tok/s; a smaller 3-4 GB model in the 24-48 tok/s band. The cross-model gap is the memory-bandwidth thesis playing out: more weight bytes per token = proportionally lower throughput, not a compute-limited gap. GTT memory lets the iGPU address tens of gigabytes regardless of the BIOS UMA cap; a 24 GB model runs at 100% GPU on an integrated graphics core with a 128K context window. OLLAMA_FLASH_ATTENTION=1 + OLLAMA_KV_CACHE_TYPE=q8_0 + num_gpu=99 push more layers onto the iGPU. vLLM is for datacenter GPUs and doesn't support the 780M. Thermal surprise: the iGPU shares the same physical package as the CPU cores and there's one temperature sensor; "100% GPU" inference heats the package and shows up as "CPU temperature." A PPT cap at 65 W drops a 90C load to under 60C with no measurable speed loss since inference is memory-bound. The frontend was the next problem: tried Open WebUI for two days and uninstalled it. The chat surface itself is good (clean thread, conversation list, in-thread renderer, keyboard shortcuts). The blockers are underneath: the configuration UI is weird and two days in I still wasn't sure which of several places held the "default model for new chats" setting; providers and models have two unlinked configuration surfaces, and one of them does nothing (the first one I edited was vestigial, the OTHER was the one that mattered); and Open WebUI can't reach the loomcycle tools and primitives I'd built workflows around (Documents, Channels, Interruption + mid-run steering on interactive sessions, per-principal MCP dispatch). So I'm building the chat I wanted on top of the substrate I already use, following the chat-first sequencing in RFC AC. Chat surface ships first: a standalone React + Vite SPA in a new loomboard repo on the published @loomcycle/client; UX modelled on what Open WebUI gets right; each conversation is one loomcycle interactive session (RFC AI); the full tool loop renders inline; live token/throughput/context metrics + a context-compaction button; Interruption answers in place; per-conversation model overrides via a derived AgentDef that doesn't mutate the shared one; reuses existing wire only (no new transports). Board lands next in the same app: kanban over Document + Path, AgentTeam graphs from RFC AP for state transitions, launch publishing plan as the first dogfood loop. In parallel, the two loomcycle pieces I'm head-down on right now are tenant authorization (a real multi-tenant trust boundary across the wire surfaces) and loomcycle running as a TrueNAS-dockerized application; both deserve their own writeup as the next blog topic. The point is that once the hardware worked, the frontend was the bottleneck.

source & further reading

loomcycle.dev — original article Budgets, costs, and encrypted credentials (v1.9.0 to v1.11.1) Tenant surfaces, TrueNAS deployment, and thoughts on the wire (v1.6.1 to v1.8.2) Bashbox: in-process shell sandbox for agents. The bench, and three gbash issues (v1.3.0)

~/api · this article 200

$curl api.wpnews.pro/v1/news/local-llms-on-my-truenas…

Read original on loomcycle.dev → loomcycle.dev/blog/local-llms-on-truenas-and-the…

mentioned entities

TrueNAS

AMD

Ryzen 7 8700G

ROCm

Open WebUI

JobEmber.ai

loomcycle

DDR5

metadata

sluglocal-llms-on-my-truenas-and-the-frontend-i-had-to-build

topic#large-language-models

secondary3 topics

sentimentneutral

canonicalloomcycle.dev

navigation

← prev'We don't view this as a bubble'…

next →When You Gaze Into the AI Slop, …

── more in #large-language-models 4 stories · sorted by recency

loomcycle.dev · 30 Jun · #large-language-models

Local LLMs on a Ryzen 8700G iGPU: 13-15 tok/s on gemma4, 9-12 on qwen3.6

sourcefeed.dev · 4 Jul · #large-language-models

AMD's GLM-5.2 win over Blackwell is a software story

letsdatascience.com · 4 Jul · #large-language-models

Microsoft Shares Rally After Haleon AI Deal

byteiota.com · 4 Jul · #large-language-models

Claude Sonnet 5: What Developers Need to Know Before Migrating

── more on @truenas 3 stories trending now

wpnews · 30 May · #ai-safety

Nightcord Security Analysis Report - Threat Investigation

wpnews · 26 May · #ai-infrastructure

ML-KEM + X-Wing Patches Posted For Linux To Help With Post-Quantum Security

wpnews · 4 Jul · #artificial-intelligence

Istota, a personal AI operating system

sponsored brought to you by zahid.host 4,200+ EU-deployed projects

reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main

→ Live at https://your-agent.zahid.host ✓

Get free account → Pricing

from €0/mo · no card required