Field log from upgrading a lab NAS (Intel N100, 16 GB DDR5) into one box that hosts product-test VMs (JobEmber.ai plus a sibling SaaS in stealth pre-release), the loomcycle multi-replica server, and local LLM inference. The constraint framed every other decision: no spare $4500-5500 for an NVIDIA DGX Spark, a Mac Studio with serious unified memory in the same band, Strix Halo (Ryzen AI MAX) starts around EUR 4,000 / $5,000 in Europe and everything is soldered. That reframe ruled out the Spark on price, Strix Halo on price AND rigidity, and a discrete-GPU build on cost-per-model-GB and thermal-envelope grounds. Total build cost: ~EUR 2,100 (Ryzen 7 8700G + 96 GB DDR5 + motherboard + new PSU), roughly half the entry price of the rejected options. The answer was upgrade the existing NAS: AM5 socket (swappable chip), DIMM DDR5 (upgradeable capacity and timing), an APU as the inference engine, and a clean upgrade path for the next-generation Ryzen APU. With that locked, final build: AMD Ryzen 7 8700G with 96 GB of DDR5, doubling as the existing TrueNAS NAS. An APU is not the same as a desktop CPU with integrated graphics; the 8700G's Radeon 780M (12 CUs) is the entry point, the 2-CU iGPUs on regular Ryzen and Intel chips are useless for inference; there is no 12-core or 16-core APU with a strong iGPU in AM5. Memory bandwidth not core count is the bottleneck; DDR5-6000 CL30 EXPO is the AM5 sweet spot (Phoenix controller tops out around 6000-6400 MT/s with two DIMMs); a DDR5-8000 kit downclocks. Kit suffix encodes the profile (Corsair Z = EXPO, C = XMP). Migration: fresh-install plus config restore (don't clone the boot pool); ZFS data pools are portable via zpool import; bigger disks use ZFS replication; anything outside the GUI doesn't transfer. gfx1103 is not officially supported by ROCm; force HSA_OVERRIDE_GFX_VERSION=11.0.2 + OLLAMA_IGPU_ENABLE=1; if rocBLAS errors on TensileLibrary.dat install prebuilt gfx1103 Tensile kernels. Real-workload throughput on this box: gemma4:latest at 13-15 tok/s; qwen3.6:latest at 9-12 tok/s; a smaller 3-4 GB model in the 24-48 tok/s band. The cross-model gap is the memory-bandwidth thesis playing out: more weight bytes per token = proportionally lower throughput, not a compute-limited gap. GTT memory lets the iGPU address tens of gigabytes regardless of the BIOS UMA cap; a 24 GB model runs at 100% GPU on an integrated graphics core with a 128K context window. OLLAMA_FLASH_ATTENTION=1 + OLLAMA_KV_CACHE_TYPE=q8_0 + num_gpu=99 push more layers onto the iGPU. vLLM is for datacenter GPUs and doesn't support the 780M. Thermal surprise: the iGPU shares the same physical package as the CPU cores and there's one temperature sensor; "100% GPU" inference heats the package and shows up as "CPU temperature." A PPT cap at 65 W drops a 90C load to under 60C with no measurable speed loss since inference is memory-bound. The frontend was the next problem: tried Open WebUI for two days and uninstalled it. The chat surface itself is good (clean thread, conversation list, in-thread renderer, keyboard shortcuts). The blockers are underneath: the configuration UI is weird and two days in I still wasn't sure which of several places held the "default model for new chats" setting; providers and models have two unlinked configuration surfaces, and one of them does nothing (the first one I edited was vestigial, the OTHER was the one that mattered); and Open WebUI can't reach the loomcycle tools and primitives I'd built workflows around (Documents, Channels, Interruption + mid-run steering on interactive sessions, per-principal MCP dispatch). So I'm building the chat I wanted on top of the substrate I already use, following the chat-first sequencing in RFC AC. Chat surface ships first: a standalone React + Vite SPA in a new loomboard repo on the published @loomcycle/client; UX modelled on what Open WebUI gets right; each conversation is one loomcycle interactive session (RFC AI); the full tool loop renders inline; live token/throughput/context metrics + a context-compaction button; Interruption answers in place; per-conversation model overrides via a derived AgentDef that doesn't mutate the shared one; reuses existing wire only (no new transports). Board lands next in the same app: kanban over Document + Path, AgentTeam graphs from RFC AP for state transitions, launch publishing plan as the first dogfood loop. In parallel, the two loomcycle pieces I'm head-down on right now are tenant authorization (a real multi-tenant trust boundary across the wire surfaces) and loomcycle running as a TrueNAS-dockerized application; both deserve their own writeup as the next blog topic. The point is that once the hardware worked, the frontend was the bottleneck.
Local LLMs on a Ryzen 8700G iGPU: 13-15 tok/s on gemma4, 9-12 on qwen3.6