{"slug": "running-a-35b-moe-model-on-a-2017-amd-rx-580-8gb-via-vulkan-no-rocm-cuda", "title": "Running a 35B MoE model on a 2017 AMD RX 580 8GB via Vulkan (no ROCm/CUDA)", "summary": "A developer successfully ran a 35-billion-parameter Mixture-of-Experts model on a 2017 AMD RX 580 8GB GPU using Vulkan, bypassing CUDA and ROCm. The project, called Polaris Revival, achieved 17-18 tokens per second for LLM inference and 72 seconds per image for SD 1.5, proving that older AMD hardware can still run modern AI workloads locally and privately.", "body_md": "\n\n```\n██████╗ ██╗  ██╗    ███████╗ █████╗  ██████╗\n██╔══██╗╚██╗██╔╝    ██╔════╝██╔══██╗██╔═████╗\n██████╔╝ ╚███╔╝     ███████╗╚█████╔╝██║██╔██║\n██╔══██╗ ██╔██╗     ╚════██║██╔══██╗████╔╝██║\n██║  ██║██╔╝ ██╗    ███████║╚█████╔╝╚██████╔╝\n╚═╝  ╚═╝╚═╝  ╚═╝    ╚══════╝ ╚════╝  ╚═════╝\nAIVisionsLab · Polaris Revival Project · 2026\n```\n\nGPU from 2017. SOTA AI in 2026. No CUDA. No ROCm. No cloud. No excuses.\n\n```\n\"Your RX 580 can't run AI. Buy a new GPU.\"\n```\n\nAMD dropped ROCm for Polaris/GCN4 in v5.x. DirectML crashes with `OpaqueTensorImpl`\n\n. OpenVINO fails silently on Forge. The mainstream AI stack gave up on this card.\n\nWe didn't.\n\nBy compiling `llama.cpp`\n\nand `stable-diffusion.cpp`\n\nfrom source with Vulkan support, the RX 580 runs real, useful AI inference in 2026 — locally, offline, privately. This repository is the complete technical record of how.\n\n```\nRX 580 8GB  ──►  Vulkan API  ──►  ggml engine  ──►  17 tok/s LLM  +  72s/image SD\nXeon 2014   ──►  WSL2 CPU    ──►  ComfyUI       ──►  FLUX 16GB  +  AnimateDiff\n```\n\n[Hardware](#hardware)[Benchmarks](#benchmarks-real-logs)[Architecture: Dual-Path Stack](#architecture-dual-path-stack)[Critical: Two GGUF Formats for FLUX](#critical-two-gguf-formats-for-flux)[What Failed (and Why)](#what-failed-and-why)[Quick Start: LLM via Vulkan](#quick-start-llm-via-vulkan-windows)[Quick Start: Image Generation via Vulkan](#quick-start-image-generation-via-vulkan)[FLUX Hybrid Setup](#flux-hybrid-setup-gpu--cpu)[OpenWebUI + Docker Integration](#openwebui--docker-integration)[whisper.cpp: Audio Transcription](#whispercpp-audio-transcription-on-rx-580)[Applio RVC: Voice Cloning](#applio-rvc-voice-cloning-on-amd-windows)[AnimateDiff: Video Generation](#animatediff-video-generation)[Linux Native: Ubuntu 26.04 LTS](#linux-native-ubuntu-2604-lts)[Windows vs Linux Comparison](#windows-vs-linux-benchmarks)[Troubleshooting](#troubleshooting)[Automation Scripts](#automation-scripts)[Community Timeline](#community-timeline)[Pushing the 35B Limit: Qwen3.5 MoE Hybrid Experiment](#pushing-the-35b-limit-qwen35-moe-hybrid-experiment)[Repository Structure](#repository-structure)\n\n| Component | Spec |\n|---|---|\n| GPU | AMD RX 580 2048SP 8GB GDDR5 (Polaris / GCN4) |\n| CPU | Intel Xeon E5-2690 v3 — 12c/24t · 3.5GHz (2014) |\n| RAM | 32GB DDR4 REG ECC Quad Channel |\n| Storage | NVMe 1TB — 1.7–3.5 GB/s read |\n| OS | Windows 10 Pro + WSL2 Ubuntu 22.04.5 / Ubuntu 26.04 LTS |\n| AMD Driver | 31.0.21924.61 (Amdnolk, Nov 2025) |\n| Vulkan SDK | 1.4.341.1 |\n| CMake | 4.3.2 |\n\nRX 580 2048SP note:The mining-variant with 2048 shader processors (vs the original 2304SP) performs identically through Vulkan. Both are Polaris/GCN4.\n\nNVMe impact:Upgrading from HDD to NVMe reduced FLUX.1 model load time from 25 minutes to ~30 seconds. Storage is as critical as compute.\n\n| Workload | Model | Backend | Result |\n|---|---|---|---|\n| LLM inference | Mistral 7B Q4_K_M | RX 580 Vulkan | 17–18 tok/s |\n| LLM inference | Qwen3 4B Q4_K_M | RX 580 Vulkan (Linux) | ~35 tok/s |\n| LLM baseline | Mistral 7B Q4_K_M | Xeon CPU pure | 3–5 tok/s |\n| Image gen | DreamShaper 8 (SD 1.5) | RX 580 Vulkan | ~72s / 512×512 |\n| Image gen | flux1-schnell-q4_k | GPU+CPU hybrid | ~14 min @ 1024×1024 |\n| Image gen | FLUX.1 fp8 (16GB) | Xeon WSL2 CPU | ~24 min |\n| Audio transcription | Whisper large-v3-turbo | RX 580 Vulkan (Windows) | 307s for 15min audio |\n| Audio transcription | Whisper large-v3-turbo | RX 580 Vulkan (Linux) | 23.58s for 106s audio |\n| Video / AnimateDiff | SD 1.5 pipeline | Xeon WSL2 CPU | ~141s/frame |\n| Voice clone inference | Applio RVC | Xeon CPU (2h audio) | ~30 min processing |\n\nWhisper on Linux (Mesa RADV) is absurdly faster than Windows — ~150× speedup over pure CPU. VRAM usage: only 1.6GB of 8GB available.\n\nThe core insight of this project: not every workload fits in 8GB of VRAM. The solution is routing intelligently between GPU and CPU rather than forcing everything through one path.\n\n```\nOpenWebUI  :3000  (Docker)\n    │\n    ├──► llama-server  :8081  ──►  RX 580 Vulkan  [llama.cpp]\n    │         └── Ollama      :11434  ──►  CPU fallback\n    │\n    └──► sd-server     :7860  ──►  RX 580 Vulkan  [stable-diffusion.cpp]\n              ├── SD 1.5 GGUF      ──►  72s / image   ✅\n              └── FLUX hybrid      ──►  ~14 min / image  ✅\n    \n    └──► ComfyUI       :8188  ──►  Xeon CPU WSL2   [heavy models > 8GB VRAM]\n```\n\n**Path 1 — GPU Vulkan (RX 580):** All LLM inference + SD 1.5 image generation. Fast, responsive, daily driver.\n\n**Path 2 — CPU Xeon (WSL2):** FLUX.1 16GB models, AnimateDiff video pipelines. Slow but stable. The 32GB ECC RAM acts as \"virtual VRAM.\"\n\n**This trips up almost everyone.**\n\n| Source | Compatible with |\n|---|---|\n`city96` (HuggingFace) |\nComfyUI + ComfyUI-GGUF node only |\n`leejet` (HuggingFace) |\n`stable-diffusion.cpp` / `sd-server` ✅ |\n\nUsing a `city96`\n\nGGUF in `sd-server`\n\nreturns:\n\n```\n[ERROR] main.cpp:92 - new_sd_ctx_t failed\n```\n\nAlways download FLUX weights from: [huggingface.co/leejet/FLUX.1-schnell-gguf](https://huggingface.co/leejet/FLUX.1-schnell-gguf)\n\nWe documented every dead end. These aren't opinions — they're error logs.\n\n| Attempt | Error | Root Cause |\n|---|---|---|\nDirectML + ComfyUI |\n`NotImplementedError: Cannot access storage of OpaqueTensorImpl` |\nDirectML wraps tensors in opaque objects that ComfyUI's attention backends can't read. Also: abandoned by Microsoft, last update Sep 2024. |\nROCm on Polaris |\nKernel panics under load | AMD officially dropped GCN4/Polaris in ROCm v5.x. No Windows support either. |\nOpenVINO + Forge |\n`ModuleNotFoundError: No module named 'ldm'` |\nExtension targets old A1111 architecture. Forge restructured `ldm` /`sgm` modules completely. |\nCPU-only + HDD |\n~19 min/image, 85s startup | No GPU acceleration + mechanical I/O bottleneck. The HDD was the hidden killer. |\ntorch-directml + Applio |\nVersion conflict | `torch-directml` requires `torch==2.4.1` . Applio requires `torch==2.7.1` . Irreconcilable. |\n\nFull autopsy with logs: `docs/what-failed.md`\n\nRun these commands in\n\nDeveloper PowerShell for Visual Studio.\n\n```\n# Clone and compile with Vulkan backend\ncd E:\\\ngit clone https://github.com/ggerganov/llama.cpp\ncd llama.cpp\ncmake -B build -DGGML_VULKAN=ON -DCMAKE_BUILD_TYPE=Release\ncmake --build build --config Release -j20\n\n# Validate GPU detection\ncd build\\bin\\Release\n.\\llama-cli.exe --list-devices\n# Expected: Vulkan0: AMD Radeon RX 580 2048SP ✅\n\n# Start LLM server\n.\\llama-server.exe -m \"E:\\models\\Mistral-7B-Q4_K_M.gguf\" `\n  --host 0.0.0.0 --port 8081 --device Vulkan0\n```\n\n**Verify it's using the GPU** (not CPU):\n\n```\nlog output during inference:\nggml_vulkan: Found 1 Vulkan device(s)\nggml_vulkan: 0 = AMD Radeon RX 580 2048SP | VRAM: 8192MB\n17.77 t/s  ← RX 580 Vulkan ✅\n```\n\nIf you see 3–5 t/s with no `ggml_vulkan`\n\nline — it's running on CPU. Check that `--device Vulkan0`\n\nis present.\n\n```\n# Clone with submodules (required for ggml dependency)\ngit clone --recursive https://github.com/leejet/stable-diffusion.cpp\ncd stable-diffusion.cpp\nmkdir build && cd build\ncmake .. -DGGML_VULKAN=ON -DCMAKE_BUILD_TYPE=Release\ncmake --build . --config Release -j20\n\n# Successful build log:\n# -- Found Vulkan: C:/VulkanSDK/1.4.341.1/Lib/vulkan-1.lib\n# [100%] Built target sd-server ✅\n\n# Start SD server (SD 1.5)\nE:\ncd \"E:\\stable-diffusion.cpp\\build\\bin\\Release\"\n.\\sd-server.exe --listen-ip 0.0.0.0 --listen-port 7860 `\n  -m \"E:\\models\\dreamshaper8.gguf\"\n\n# Server output confirms GPU:\n# ggml_vulkan: 0 = AMD Radeon RX 580 2048SP | VRAM: 8192MB\n# Server listening on http://0.0.0.0:7860 ✅\n```\n\nFlag compatibility note:Older builds use`--host`\n\n/`--port`\n\n. Newer builds (master-600+) use`--listen-ip`\n\n/`--listen-port`\n\n. Run`sd-server.exe --help`\n\nto check which your build expects.\n\nFLUX.1 Schnell requires ~16GB total. The strategy: put the diffusion model on VRAM, offload T5XXL and VAE to RAM.\n\n| Component | File | Allocation | Size |\n|---|---|---|---|\n| Diffusion Model | `flux1-schnell-q4_k.gguf` |\nGPU (VRAM) | ~6.5 GB |\n| VAE | `ae.safetensors` |\nCPU (RAM) | ~160 MB |\n| CLIP L | `clip_l.safetensors` |\nGPU (VRAM) | ~235 MB |\n| T5XXL | `t5xxl_fp16.safetensors` |\nCPU (RAM) | ~9.3 GB |\n\n```\nsd-server.exe --listen-ip 0.0.0.0 --listen-port 7860 ^\n  --diffusion-model \"E:\\models\\flux1-schnell-q4_k.gguf\" ^\n  --vae \"E:\\models\\ae.safetensors\" ^\n  --clip_l \"E:\\models\\clip_l.safetensors\" ^\n  --t5xxl \"E:\\models\\t5xxl_fp16.safetensors\" ^\n  --cfg-scale 1.0 --steps 4 --clip-on-cpu --vae-on-cpu --vae-tiling\n```\n\n`--vae-tiling`\n\nis not optional — without it, VAE decode causes OOM and crashes the server. To save RAM: replace`t5xxl_fp16`\n\n(~9.3GB) with`t5xxl_fp8`\n\n(~5GB).\n\n**Timing per image (1024×1024):**\n\n| Stage | Time |\n|---|---|\n| T5XXL conditioning | 11.49s |\n| Sampling (4 steps) | ~838s |\n| VAE decode (9 tiles) | 40.45s |\nTotal |\n~14 min |\n\nFull memory architecture: `docs/flux-setup.md`\n\n```\ndocker run -d \\\n  -p 3000:8080 \\\n  --add-host=host.docker.internal:host-gateway \\\n  -v open-webui:/app/backend/data \\\n  --name open-webui \\\n  --restart always \\\n  ghcr.io/open-webui/open-webui:main\n```\n\n**Connect LLM server:**\n\n- Go to\n`http://localhost:3000`\n\n→ Admin Panel → Settings → Connections - Under OpenAI API, add:\n- URL:\n`http://host.docker.internal:8081/v1`\n\n- API Key:\n`sk-local`\n\n- URL:\n- Green badge = connected ✅\n\n**Connect image server:**\n\n- Settings → Images → Engine: Automatic1111\n- URL:\n`http://192.168.x.x:7860/`\n\n(use your local IP, not 127.0.0.1, with trailing slash)\n\nNever use\n\n`127.0.0.1`\n\nfor Docker connections — Docker runs in an isolated network and cannot reach the host's localhost. Use`host.docker.internal`\n\nfor services, or your machine's LAN IP.\n\n**Windows Firewall fix** (Docker subnet blocked by default):\n\n```\n# Run as Administrator\nNew-NetFirewallRule -DisplayName \"sd-server AIVisionsLab\" `\n  -Direction Inbound -Protocol TCP -LocalPort 7860 -Action Allow\n```\n\nFull networking guide: `docs/firewall-fix.md`\n\nVulkan-accelerated audio transcription. The `large-v3-turbo`\n\nmodel uses only 2.6GB of VRAM — plenty of headroom.\n\n**Compile (Developer PowerShell):**\n\n```\n# Activate MSVC environment first (required each session)\n& \"C:\\Program Files (x86)\\Microsoft Visual Studio\\...\\vcvars64.bat\"\n\ncd C:\\\ngit clone https://github.com/ggml-org/whisper.cpp\ncd whisper.cpp\ncmake -B build -DGGML_VULKAN=ON -DGGML_HIPBLAS=OFF -DGGML_HIP=OFF -DGGML_CUDA=OFF\ncmake --build build --config Release -j4\n```\n\n**Download model:**\n\n```\nInvoke-WebRequest `\n  -Uri \"https://huggingface.co/ggerganov/whisper.cpp/resolve/main/ggml-large-v3-turbo.bin\" `\n  -OutFile \"models\\ggml-large-v3-turbo.bin\"\n```\n\n**Transcribe (MP4 → TXT):**\n\n```\n# Step 1: Extract audio (Whisper requires WAV on Windows)\nffmpeg -i \"video.mp4\" -ar 16000 -ac 1 -c:a pcm_s16le \"audio.wav\"\n\n# Step 2: Transcribe\n.\\build\\bin\\Release\\whisper-cli.exe `\n  -m models\\ggml-large-v3-turbo.bin `\n  -f \"audio.wav\" -l pt --output-txt\n\n# With translation to English:\n.\\build\\bin\\Release\\whisper-cli.exe `\n  -m models\\ggml-large-v3-turbo.bin `\n  -f \"audio.wav\" -l pt --translate --output-txt\n```\n\n**Performance (15-min video, Windows):**\n\n| Stage | Time |\n|---|---|\n| Model load | 4s |\n| Mel spectrogram | 1.2s |\n| GPU encode | 73s |\n| Decode + batch | 168s |\nTotal |\n307s |\n\nVRAM used: 2.6GB of 8GB. CPU stays at ~5%.\n\n⚠️ WSL2 does not expose the RX 580 to Vulkan — always use native Windows PowerShell for GPU transcription.⚠️ `--translate`\n\nonly outputs English. For other target languages, add a translation step after.\n\nFull pipeline: `Text → Balabolka (TTS) → WAV → Applio RVC (voice conversion) → final audio`\n\n**Why this pipeline instead of pure TTS:**\n\n| Aspect | Pure XTTS | Antônio Neural → Yuri RVC |\n|---|---|---|\n| Prosody | Artificial | Human (real actor) |\n| Long texts | Degrades | Stable |\n| Vocal identity | Generic | Cloned |\n| Naturalness | 60–70% | 80–95% |\n\n**Key findings for AMD Windows (2026):**\n\nDirectML acceleration is effectively dead — `torch-directml`\n\nrequires `torch==2.4.1`\n\nwhile Applio requires `torch==2.7.1`\n\n. The version conflict is irreconcilable. **Use CPU mode — it works, just takes time.**\n\nTraining speed on Xeon E5-2690 v3: ~6 min/epoch. 200 epochs = ~20 hours.\n\n**Critical gotchas:**\n\n```\n# NEVER set these — they silently break feature extraction:\n# set CUDA_VISIBLE_DEVICES=-1\n# set ROCM_VISIBLE_DEVICES=-1\n# They leave logs/project/extracted/ empty, training \"succeeds\" but produces nothing.\n\n# Always verify after extraction:\ndir logs\\my-project\\extracted\\   # Must contain .npy files\n```\n\n**Create required mute files** (missing from git install):\n\n``` python\npython -c \"\nimport numpy as np, soundfile as sf, os\n[os.makedirs(d, exist_ok=True) for d in [\n    'logs/mute/sliced_audios','logs/mute/extracted',\n    'logs/mute/f0','logs/mute/f0_voiced'\n]]\nsf.write('logs/mute/sliced_audios/mute40000.wav', np.zeros(int(40000*3.7)), 40000)\nsf.write('logs/mute/sliced_audios/mute48000.wav', np.zeros(int(48000*3.7)), 48000)\nnp.save('logs/mute/extracted/mute.npy', np.zeros((196, 768)))  # shape (196,768) critical\nnp.save('logs/mute/f0/mute.wav.npy', np.zeros(100))\nnp.save('logs/mute/f0_voiced/mute.wav.npy', np.zeros(100))\nprint('OK')\n\"\n```\n\nFull guide: `docs/applio-rvc.md`\n\nAnimateDiff injects temporal attention modules into SD 1.5, converting still-image diffusion into coherent video loops. Runs on Xeon CPU via ComfyUI in WSL2.\n\n```\n# WSL2 Ubuntu\nconda activate comfy_env\npython main.py --cpu --listen 0.0.0.0 --port 8188\n```\n\nAccess from Windows: `http://localhost:8188`\n\nPerformance: ~141 seconds per frame on Xeon E5-2690 v3 (24 threads).\n\nBare-metal Linux (no WSL2, no Docker GPU passthrough) with Mesa RADV open-source drivers.\n\n**System:** Ubuntu 26.04 LTS (Resolute Raccoon), Kernel 7.0, Mesa RADV 26.0.3, Vulkan 1.4.341\n\n**Validate GPU:**\n\n```\nlspci | grep -i vga\n# AMD Radeon RX 580 2048SP detected\n\nvulkaninfo --summary 2>/dev/null | grep -A5 \"Devices\"\n# GPU0: DRIVER_ID_MESA_RADV | driverInfo = Mesa 26.0.3 ✅\n```\n\n**LLM server:**\n\n```\n~/llama.cpp/build/bin/llama-server \\\n  -m \"/run/media/user/NVMe/models/Qwen3-4B-Q4_K_M.gguf\" \\\n  --host 0.0.0.0 --port 8081 \\\n  -ngl 99 -t 24\n```\n\n**FLUX image server:**\n\n```\n~/stable-diffusion.cpp/build/bin/sd-server \\\n  --listen-ip 0.0.0.0 --listen-port 7860 \\\n  --diffusion-model /path/to/flux1-schnell-q4_k.gguf \\\n  --vae /path/to/ae.safetensors \\\n  --clip_l /path/to/clip_l.safetensors \\\n  --t5xxl /path/to/t5xxl_fp16.safetensors \\\n  --cfg-scale 1.0 --steps 4 --clip-on-cpu --vae-on-cpu --vae-tiling\n```\n\n`--vae-tiling`\n\nis mandatory on Linux too — without it, VAE decode crashes the GNOME display server. Avoid`--backend vulkan0`\n\nfor heavy models on Linux — causes context-loss bugs.\n\n**Whisper transcription:**\n\n```\n~/whisper.cpp/build/bin/whisper-cli \\\n  -m ~/whisper.cpp/models/ggml-large-v3-turbo.bin \\\n  -f \"audio.wav\" -l pt --output-txt\n```\n\n**Docker services running in parallel:**\n\n| Container | Image | Port | Purpose |\n|---|---|---|---|\n| open-webui | `ghcr.io/open-webui/open-webui:main` |\n3000 | Chat UI |\n| portainer | `portainer/portainer-ce` |\n9000 | Docker management |\n| searxng | `searxng/searxng:latest` |\n8080 | Private search for RAG |\n\n⚠️ ROCm is not usable on Polaris/GCN4.AMD dropped support. Running Ollama with GPU via Docker on RX 580 will fail. Use`llama-server`\n\ncompiled with Vulkan instead, and keep Docker for frontends only.\n\n| Workload | Windows 10 | Ubuntu 26.04 (Mesa RADV) | Winner |\n|---|---|---|---|\n| LLM Qwen3 4B @ 99 layers | ~15–17 tok/s | ~35 tok/s |\n🏆 Linux (2×) |\n| LLM Qwen3.6 35B @ max layers | 7.62 tok/s (max 10 layers) | 5.18 tok/s (max 20 layers) | ⚖️ Technical tie |\n| SD 1.5 DreamShaper (50 steps) | ~72s |\n~85s | 🏆 Windows |\n| FLUX Schnell (4 steps, 512×512) | ~84s | ~52s sampling (~95s total) |\n🏆 Linux |\n| Whisper large-v3-turbo (106s audio) | 307s · 2.6GB VRAM | 23.58s · 1.6GB VRAM |\n🏆 Linux (absurd) |\n\n**Why Linux is faster for LLM:** Mesa RADV allows up to 20 GPU layers for the 35B model where Windows AMD drivers cap at 10. For smaller models, RADV's memory management is simply more efficient.\n\n**Why Windows wins SD 1.5:** The proprietary AMD driver has more stable direct rendering for this specific workload.\n\n**Whisper gap explained:** Mesa RADV's Vulkan compute path for whisper.cpp is significantly more optimized than the Windows AMD driver equivalent. A 13× speedup on the same GPU, same model.\n\n** generate_image returned no results / frozen terminal**\nCause:\n\n`sd-server`\n\ninteger overflow bug with random seeds (`Seed: -1`\n\n).\nFix: Set a fixed integer seed in OpenWebUI advanced options (e.g., `42`\n\n, `1337`\n\n).**Model trained successfully instantly (Applio)**\nThis is a silent failure. Training completes in seconds and produces nothing.\nCause: `CUDA_VISIBLE_DEVICES=-1`\n\nor similar environment variables were set, breaking feature extraction.\nFix: Open a clean PowerShell with no prior `set`\n\ncommands. Verify `logs/project/extracted/`\n\ncontains `.npy`\n\nfiles after extraction before starting training.\n\n**FLUX OOM / DeviceMemoryAllocation crash**\nFix: Ensure `--vae-tiling`\n\nflag is present. Confirm T5XXL is on CPU (`--clip-on-cpu --vae-on-cpu`\n\n). Consider switching to `t5xxl_fp8`\n\nto save ~4.3GB RAM.\n\n** new_sd_ctx_t failed with FLUX GGUF**\nYou're using a\n\n`city96`\n\nGGUF. These only work in ComfyUI with the ComfyUI-GGUF node.\nFix: Download from [leejet's repo](https://huggingface.co/leejet/FLUX.1-schnell-gguf)instead.\n\n**Docker can't reach sd-server or llama-server**\nCause: Windows Defender blocks Docker's\n\n`172.x.x.x`\n\nsubnet by default.\nFix: See [OpenWebUI + Docker Integration](#openwebui--docker-integration)— add the firewall rule.\n\n**Compilation errors in WSL2 for Vulkan builds**\nWSL2 does not expose the RX 580 for Vulkan compute. Compile and run GPU workloads from native Windows PowerShell only. Use WSL2 exclusively for CPU workloads (ComfyUI, Applio, Ollama CPU fallback).\n\n** --override-tensor exps=CPU slows down inference on Vulkan**\nThis flag is optimized for CUDA/PCIe on Nvidia. Under Vulkan, the CPU↔GPU memory transfer overhead destroys any MoE offloading gains. Do not apply CUDA-optimized flags to Vulkan backends.\n\nSave as `iniciar_ia_server.bat`\n\non the Desktop:\n\n```\n@echo off\ntitle Servidor IA Local - Producao\ncls\n\n:: Kill ghost processes holding VRAM/ports\ntaskkill /f /im sd-server.exe 2>nul\ntaskkill /f /im llama-server.exe 2>nul\ntimeout /t 2 /nobreak >nul\n\n:: Start LLM server (Vulkan)\nstart \"LLM Server - Vulkan RX580\" C:\\llama.cpp\\build\\bin\\Release\\llama-server.exe ^\n  -m \"E:\\models\\Mistral-7B-Q4_K_M.gguf\" ^\n  --host 0.0.0.0 --port 8081 --device Vulkan0\n\ntimeout /t 3 /nobreak >nul\n\n:: Start SD server (Vulkan)\nE:\ncd \"E:\\stable-diffusion.cpp\\build\\bin\\Release\"\nsd-server.exe --listen-ip 0.0.0.0 --listen-port 7860 ^\n  -m \"E:\\models\\dreamshaper8.gguf\"\n\npause\n```\n\n**Critical rules:**\n\n`taskkill`\n\nbefore start: releases VRAM from stuck background processes`--host 0.0.0.0`\n\n: required for Docker to reach the server`--device Vulkan0`\n\n: without this, inference falls back to CPU (3–5 tok/s)- Never use\n`.\\`\n\nbefore executables in CMD — it breaks the shell - Jump drive (\n`E:`\n\n) before`cd`\n\n— CMD doesn't change drives automatically\n\nInstant validation scripts — run before building anything.\n\n```\n# Linux / WSL2\n./vulkan-diagnostic.sh\n:: Windows CMD\nvulkan-diagnostic.bat\n```\n\nExpected output:\n\n```\nggml_vulkan: Found 1 Vulkan device(s)\nggml_vulkan: 0 = AMD Radeon RX 580 2048SP | VRAM: 8192MB  ✅\n```\n\nIf your card doesn't appear — driver or Vulkan SDK issue. See [Master Documentation](https://setup-ia-local-rx580-vulkan.web.app).\n\nThree independent researchers. Same GPU. Same conclusion: the hardware was never the problem.\n\n| Date | Author | Contribution |\n|---|---|---|\n| Jan 2025 | 艾米心 Amihart | First documented LLM via Vulkan on RX 580 — 24.56 tok/s on Debian. Declared SD via Vulkan \"not viable\" (limitation of `sd.cpp` at that time). |\n| Dec 2025 | DH / DadHacks | Refuted Amihart's SD conclusion. Used `stable-diffusion.cpp` with `-DSD_VULKAN=ON` , ran FLUX Schnell GGUF generation on RX 580 from terminal. |\n| 2026 | AIVisionsLab | Full Windows production stack: Vulkan LLM + SD + FLUX hybrid + OpenWebUI + Docker networking + Applio RVC + AnimateDiff + whisper.cpp + Linux native benchmarks. |\n\n| Capability | Amihart | DadHacks | AIVisionsLab |\n|---|---|---|---|\n| LLM Vulkan | ✅ 24.56 tok/s | ✅ | ✅ 15–35 tok/s |\n| SD via Vulkan | ❌ | ✅ CLI | ✅ Server + API |\n| FLUX GGUF | ❌ | ✅ CLI | ✅ Hybrid GPU/CPU |\n| GUI / OpenWebUI | Docker only | ❌ | ✅ Full integration |\n| Windows native | ❌ | ❌ | ✅ |\n| Automation scripts | ❌ | ❌ | ✅ .bat double-click |\n| Voice cloning | ❌ | ❌ | ✅ Applio RVC |\n| Video / AnimateDiff | ❌ | ❌ | ✅ |\n| Audio transcription | ❌ | ❌ | ✅ whisper.cpp |\n| Linux native (Ubuntu 26.04) | Debian | Debian | ✅ Ubuntu 26.04 LTS |\n| GGUF format mapping | ❌ | ❌ | ✅ city96 vs leejet |\n\nThe shared technical foundation: `ggml`\n\n/ `llama.cpp`\n\n/ `stable-diffusion.cpp`\n\nby Georgi Gerganov. Vulkan compute backends in pure C++ that bypass the entire ROCm/CUDA ecosystem.\n\nCredit: 艾米心 (Amihart), DH (DadHacks), leejet, ggerganov, woodrex, and all independent developers working on hardware preservation and open inference.\n\nTwo lab sessions pushed the dual-path stack to its extreme: running a **34.66B-parameter MoE model** (Qwen3.5-35B) on the same RX 580 8GB, using llama.cpp's automatic GPU/RAM fitting across **4 memory tiers** (VRAM → DDR4 ECC → NVMe → HDD swap).\n\n**Quick links — six focused docs, one question each:**\n\n| Doc | Answers |\n|---|---|\n|\n\n[Benchmarks](/aivisionslab-studios/rx580-local-ai-guide/blob/main/docs/qwen35-35b/benchmarks.md)[Thinking mode context overflow](/aivisionslab-studios/rx580-local-ai-guide/blob/main/docs/qwen35-35b/thinking-mode-context-overflow.md)[OpenWebUI timeout vs server truncation](/aivisionslab-studios/rx580-local-ai-guide/blob/main/docs/qwen35-35b/openwebui-timeout-vs-server-truncation.md)[ctx-size and quantization tuning](/aivisionslab-studios/rx580-local-ai-guide/blob/main/docs/qwen35-35b/ctx-size-and-quantization-tuning.md)[Model reasoning about its own architecture](/aivisionslab-studios/rx580-local-ai-guide/blob/main/docs/qwen35-35b/model-reasoning-about-its-own-architecture.md)`<think>`\n\ntraces look like?Full narrative lab reports (raw logs, timelines, complete test history):\n\n- 📄\n[docs/qwen35-35b-hybrid-experiment.md](/aivisionslab-studios/rx580-local-ai-guide/blob/main/docs/qwen35-35b-hybrid-experiment.md)(Session 1) - 📄\n[docs/qwen35-35b-proving-hypothesis.md](/aivisionslab-studios/rx580-local-ai-guide/blob/main/docs/qwen35-35b-proving-hypothesis.md)(Session 2)\n\n**Key takeaway:** the RX 580 never crashed or throttled across either session (peak 80°C, limit ~90°C). Every failure traced back to software-side timeouts and context-buffer limits — not hardware capacity. With `--ctx-size 8192`\n\nand Q4_K_M quantization, a 35B MoE model runs stable, full responses included, entirely on a 2017 GPU.\n\n```\nGPU from 2017 + CPU from 2014  ──►  34.66B parameters  ──►  5.6–6.6 tok/s\nrx580-local-ai-guide/\n│\n├── scripts/\n│   ├── start-ai.bat              # Full stack — all services\n│   ├── iniciar_sd_server.bat     # SD 1.5 only\n│   ├── iniciar_flux_server.bat   # FLUX hybrid GPU/CPU\n│   ├── reboot_stack.bat          # Kill all + restart\n│   ├── vulkan-diagnostic.bat     # Vulkan validation (Windows)\n│   ├── vulkan-diagnostic.sh      # Vulkan validation (Linux/WSL2)\n│   ├── build-llamacpp.sh         # Compile llama.cpp (Linux/WSL2)\n│   └── build-sdcpp.sh            # Compile stable-diffusion.cpp\n│\n├── docs/\n│   ├── benchmarks.md             # Real hardware logs, full tables\n│   ├── what-failed.md            # DirectML, ROCm, OpenVINO autopsy\n│   ├── flux-setup.md             # FLUX hybrid memory architecture\n│   ├── firewall-fix.md           # Docker + Windows Firewall\n│   ├── wsl2-setup.md             # ComfyUI CPU on WSL2\n│   ├── applio-rvc.md             # Voice cloning full guide\n│   ├── whisper-cpp.md            # Audio transcription guide\n│   ├── linux-ubuntu2604.md       # Ubuntu 26.04 bare-metal guide\n│   ├── qwen35-35b-hybrid-experiment.md     # 35B MoE hybrid limit test — full log (Session 1)\n│   ├── qwen35-35b-proving-hypothesis.md    # 35B MoE ctx-size/curl proof — full log (Session 2)\n│   └── qwen35-35b/                         # Atomic SEO-focused docs, one question per page\n│       ├── README.md\n│       ├── running-35b-on-8gb-vram.md\n│       ├── benchmarks.md\n│       ├── thinking-mode-context-overflow.md\n│       ├── openwebui-timeout-vs-server-truncation.md\n│       ├── ctx-size-and-quantization-tuning.md\n│       └── model-reasoning-about-its-own-architecture.md\n│\n├── vulkan-diagnostic.bat         # Quick validation (root, Windows)\n├── vulkan-diagnostic.sh          # Quick validation (root, Linux)\n└── README.md\n```\n\n[llama.cpp](https://github.com/ggerganov/llama.cpp)— The engine behind LLM inference[stable-diffusion.cpp](https://github.com/leejet/stable-diffusion.cpp)— Image generation in C++[whisper.cpp](https://github.com/ggml-org/whisper.cpp)— Audio transcription in C++[OpenWebUI](https://github.com/open-webui/open-webui)— The chat/image interface[AIVisionsLab Portal](https://setup-ia-local-rx580-vulkan.web.app)— Full documentation (PT/EN)\n\nThis guide is built from real hardware testing, real error logs, and real failures. If you:\n\n- Got this working on a related card (RX 570, RX 590, RX 5500 XT)\n- Found better build flags or quantization settings for Vulkan\n- Have benchmarks from a different CPU/RAM configuration\n- Fixed a bug in the scripts\n\nOpen a PR or issue. Everything here is MIT licensed — use it, fork it, share it.\n\nMIT — do whatever you want with this. Just don't tell people their old GPU is useless.\n\n*Built in São Paulo, Brazil 🇧🇷 · Hardware from 2014–2017 · Running SOTA AI in 2026*\n\n\"The problem was never the GPU.\"", "url": "https://wpnews.pro/news/running-a-35b-moe-model-on-a-2017-amd-rx-580-8gb-via-vulkan-no-rocm-cuda", "canonical_source": "https://github.com/aivisionslab-studios/rx580-local-ai-guide", "published_at": "2026-06-20 22:16:06+00:00", "updated_at": "2026-06-20 22:37:31.260227+00:00", "lang": "en", "topics": ["ai-infrastructure", "ai-tools", "machine-learning", "large-language-models", "generative-ai"], "entities": ["AMD", "RX 580", "Vulkan", "llama.cpp", "stable-diffusion.cpp", "Mistral 7B", "FLUX.1", "Whisper"], "alternates": {"html": "https://wpnews.pro/news/running-a-35b-moe-model-on-a-2017-amd-rx-580-8gb-via-vulkan-no-rocm-cuda", "markdown": "https://wpnews.pro/news/running-a-35b-moe-model-on-a-2017-amd-rx-580-8gb-via-vulkan-no-rocm-cuda.md", "text": "https://wpnews.pro/news/running-a-35b-moe-model-on-a-2017-amd-rx-580-8gb-via-vulkan-no-rocm-cuda.txt", "jsonld": "https://wpnews.pro/news/running-a-35b-moe-model-on-a-2017-amd-rx-580-8gb-via-vulkan-no-rocm-cuda.jsonld"}}