cd /news/large-language-models/local-llms-on-my-truenas-and-the-fro… · home topics large-language-models article
[ARTICLE · art-45254] src=loomcycle.dev ↗ pub= topic=large-language-models verified=true sentiment=· neutral

Local LLMs on my TrueNAS, and the frontend I had to build

A developer upgraded a TrueNAS NAS with an AMD Ryzen 7 8700G APU and 96 GB DDR5 for local LLM inference, achieving 13-15 tok/s on gemma4, after rejecting pricier options like NVIDIA DGX Spark and Mac Studio. The build cost ~EUR 2,100, half the entry price of alternatives, and required workarounds for ROCm support and thermal management. The developer also built a custom frontend after finding Open WebUI inadequate for their workflow.

read4 min views14 publishedJun 28, 2026

Field log from upgrading a lab NAS (Intel N100, 16 GB DDR5) into one box that hosts product-test VMs (JobEmber.ai plus a sibling SaaS in stealth pre-release), the loomcycle multi-replica server, and local LLM inference. The constraint framed every other decision: no spare $4500-5500 for an NVIDIA DGX Spark, a Mac Studio with serious unified memory in the same band, Strix Halo (Ryzen AI MAX) starts around EUR 4,000 / $5,000 in Europe and everything is soldered. That reframe ruled out the Spark on price, Strix Halo on price AND rigidity, and a discrete-GPU build on cost-per-model-GB and thermal-envelope grounds. Total build cost: ~EUR 2,100 (Ryzen 7 8700G + 96 GB DDR5 + motherboard + new PSU), roughly half the entry price of the rejected options. The answer was upgrade the existing NAS: AM5 socket (swappable chip), DIMM DDR5 (upgradeable capacity and timing), an APU as the inference engine, and a clean upgrade path for the next-generation Ryzen APU. With that locked, final build: AMD Ryzen 7 8700G with 96 GB of DDR5, doubling as the existing TrueNAS NAS. An APU is not the same as a desktop CPU with integrated graphics; the 8700G's Radeon 780M (12 CUs) is the entry point, the 2-CU iGPUs on regular Ryzen and Intel chips are useless for inference; there is no 12-core or 16-core APU with a strong iGPU in AM5. Memory bandwidth not core count is the bottleneck; DDR5-6000 CL30 EXPO is the AM5 sweet spot (Phoenix controller tops out around 6000-6400 MT/s with two DIMMs); a DDR5-8000 kit downclocks. Kit suffix encodes the profile (Corsair Z = EXPO, C = XMP). Migration: fresh-install plus config restore (don't clone the boot pool); ZFS data pools are portable via zpool import; bigger disks use ZFS replication; anything outside the GUI doesn't transfer. gfx1103 is not officially supported by ROCm; force HSA_OVERRIDE_GFX_VERSION=11.0.2 + OLLAMA_IGPU_ENABLE=1; if rocBLAS errors on TensileLibrary.dat install prebuilt gfx1103 Tensile kernels. Real-workload throughput on this box: gemma4:latest at 13-15 tok/s; qwen3.6:latest at 9-12 tok/s; a smaller 3-4 GB model in the 24-48 tok/s band. The cross-model gap is the memory-bandwidth thesis playing out: more weight bytes per token = proportionally lower throughput, not a compute-limited gap. GTT memory lets the iGPU address tens of gigabytes regardless of the BIOS UMA cap; a 24 GB model runs at 100% GPU on an integrated graphics core with a 128K context window. OLLAMA_FLASH_ATTENTION=1 + OLLAMA_KV_CACHE_TYPE=q8_0 + num_gpu=99 push more layers onto the iGPU. vLLM is for datacenter GPUs and doesn't support the 780M. Thermal surprise: the iGPU shares the same physical package as the CPU cores and there's one temperature sensor; "100% GPU" inference heats the package and shows up as "CPU temperature." A PPT cap at 65 W drops a 90C load to under 60C with no measurable speed loss since inference is memory-bound. The frontend was the next problem: tried Open WebUI for two days and uninstalled it. The chat surface itself is good (clean thread, conversation list, in-thread renderer, keyboard shortcuts). The blockers are underneath: the configuration UI is weird and two days in I still wasn't sure which of several places held the "default model for new chats" setting; providers and models have two unlinked configuration surfaces, and one of them does nothing (the first one I edited was vestigial, the OTHER was the one that mattered); and Open WebUI can't reach the loomcycle tools and primitives I'd built workflows around (Documents, Channels, Interruption + mid-run steering on interactive sessions, per-principal MCP dispatch). So I'm building the chat I wanted on top of the substrate I already use, following the chat-first sequencing in RFC AC. Chat surface ships first: a standalone React + Vite SPA in a new loomboard repo on the published @loomcycle/client; UX modelled on what Open WebUI gets right; each conversation is one loomcycle interactive session (RFC AI); the full tool loop renders inline; live token/throughput/context metrics + a context-compaction button; Interruption answers in place; per-conversation model overrides via a derived AgentDef that doesn't mutate the shared one; reuses existing wire only (no new transports). Board lands next in the same app: kanban over Document + Path, AgentTeam graphs from RFC AP for state transitions, launch publishing plan as the first dogfood loop. In parallel, the two loomcycle pieces I'm head-down on right now are tenant authorization (a real multi-tenant trust boundary across the wire surfaces) and loomcycle running as a TrueNAS-dockerized application; both deserve their own writeup as the next blog topic. The point is that once the hardware worked, the frontend was the bottleneck.

── more in #large-language-models 4 stories · sorted by recency
── more on @truenas 3 stories trending now
sponsored brought to you by zahid.host 4,200+ EU-deployed projects
reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main
Live at https://your-agent.zahid.host
Get free account → Pricing
from €0/mo · no card required
LIVE [news/local-llms-on-my-tru…] indexed:0 read:4min 2026-06-28 ·