cd /news/ai-infrastructure/running-a-35b-moe-model-on-a-2017-am… Β· home β€Ί topics β€Ί ai-infrastructure β€Ί article
[ARTICLE Β· art-35161] src=github.com β†— pub= topic=ai-infrastructure verified=true sentiment=↑ positive

Running a 35B MoE model on a 2017 AMD RX 580 8GB via Vulkan (no ROCm/CUDA)

A developer successfully ran a 35-billion-parameter Mixture-of-Experts model on a 2017 AMD RX 580 8GB GPU using Vulkan, bypassing CUDA and ROCm. The project, called Polaris Revival, achieved 17-18 tokens per second for LLM inference and 72 seconds per image for SD 1.5, proving that older AMD hardware can still run modern AI workloads locally and privately.

read17 min views1 publishedJun 20, 2026
Running a 35B MoE model on a 2017 AMD RX 580 8GB via Vulkan (no ROCm/CUDA)
Image: source
β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ•— β–ˆβ–ˆβ•—  β–ˆβ–ˆβ•—    β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ•— β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ•—  β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ•—
β–ˆβ–ˆβ•”β•β•β–ˆβ–ˆβ•—β•šβ–ˆβ–ˆβ•—β–ˆβ–ˆβ•”β•    β–ˆβ–ˆβ•”β•β•β•β•β•β–ˆβ–ˆβ•”β•β•β–ˆβ–ˆβ•—β–ˆβ–ˆβ•”β•β–ˆβ–ˆβ–ˆβ–ˆβ•—
β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ•”β• β•šβ–ˆβ–ˆβ–ˆβ•”β•     β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ•—β•šβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ•”β•β–ˆβ–ˆβ•‘β–ˆβ–ˆβ•”β–ˆβ–ˆβ•‘
β–ˆβ–ˆβ•”β•β•β–ˆβ–ˆβ•— β–ˆβ–ˆβ•”β–ˆβ–ˆβ•—     β•šβ•β•β•β•β–ˆβ–ˆβ•‘β–ˆβ–ˆβ•”β•β•β–ˆβ–ˆβ•—β–ˆβ–ˆβ–ˆβ–ˆβ•”β•β–ˆβ–ˆβ•‘
β–ˆβ–ˆβ•‘  β–ˆβ–ˆβ•‘β–ˆβ–ˆβ•”β• β–ˆβ–ˆβ•—    β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ•‘β•šβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ•”β•β•šβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ•”β•
β•šβ•β•  β•šβ•β•β•šβ•β•  β•šβ•β•    β•šβ•β•β•β•β•β•β• β•šβ•β•β•β•β•  β•šβ•β•β•β•β•β•
AIVisionsLab Β· Polaris Revival Project Β· 2026

GPU from 2017. SOTA AI in 2026. No CUDA. No ROCm. No cloud. No excuses.

"Your RX 580 can't run AI. Buy a new GPU."

AMD dropped ROCm for Polaris/GCN4 in v5.x. DirectML crashes with OpaqueTensorImpl

. OpenVINO fails silently on Forge. The mainstream AI stack gave up on this card.

We didn't.

By compiling llama.cpp

and stable-diffusion.cpp

from source with Vulkan support, the RX 580 runs real, useful AI inference in 2026 β€” locally, offline, privately. This repository is the complete technical record of how.

RX 580 8GB  ──►  Vulkan API  ──►  ggml engine  ──►  17 tok/s LLM  +  72s/image SD
Xeon 2014   ──►  WSL2 CPU    ──►  ComfyUI       ──►  FLUX 16GB  +  AnimateDiff

HardwareBenchmarksArchitecture: Dual-Path StackCritical: Two GGUF Formats for FLUXWhat Failed (and Why)Quick Start: LLM via VulkanQuick Start: Image Generation via VulkanFLUX Hybrid SetupOpenWebUI + Docker Integrationwhisper.cpp: Audio TranscriptionApplio RVC: Voice CloningAnimateDiff: Video GenerationLinux Native: Ubuntu 26.04 LTSWindows vs Linux ComparisonTroubleshootingAutomation ScriptsCommunity TimelinePushing the 35B Limit: Qwen3.5 MoE Hybrid ExperimentRepository Structure

Component Spec
GPU AMD RX 580 2048SP 8GB GDDR5 (Polaris / GCN4)
CPU Intel Xeon E5-2690 v3 β€” 12c/24t Β· 3.5GHz (2014)
RAM 32GB DDR4 REG ECC Quad Channel
Storage NVMe 1TB β€” 1.7–3.5 GB/s read
OS Windows 10 Pro + WSL2 Ubuntu 22.04.5 / Ubuntu 26.04 LTS
AMD Driver 31.0.21924.61 (Amdnolk, Nov 2025)
Vulkan SDK 1.4.341.1
CMake 4.3.2

RX 580 2048SP note:The mining-variant with 2048 shader processors (vs the original 2304SP) performs identically through Vulkan. Both are Polaris/GCN4.

NVMe impact:Upgrading from HDD to NVMe reduced FLUX.1 model load time from 25 minutes to ~30 seconds. Storage is as critical as compute.

Workload Model Backend Result
LLM inference Mistral 7B Q4_K_M RX 580 Vulkan 17–18 tok/s
LLM inference Qwen3 4B Q4_K_M RX 580 Vulkan (Linux) ~35 tok/s
LLM baseline Mistral 7B Q4_K_M Xeon CPU pure 3–5 tok/s
Image gen DreamShaper 8 (SD 1.5) RX 580 Vulkan ~72s / 512Γ—512
Image gen flux1-schnell-q4_k GPU+CPU hybrid ~14 min @ 1024Γ—1024
Image gen FLUX.1 fp8 (16GB) Xeon WSL2 CPU ~24 min
Audio transcription Whisper large-v3-turbo RX 580 Vulkan (Windows) 307s for 15min audio
Audio transcription Whisper large-v3-turbo RX 580 Vulkan (Linux) 23.58s for 106s audio
Video / AnimateDiff SD 1.5 pipeline Xeon WSL2 CPU ~141s/frame
Voice clone inference Applio RVC Xeon CPU (2h audio) ~30 min processing

Whisper on Linux (Mesa RADV) is absurdly faster than Windows β€” ~150Γ— speedup over pure CPU. VRAM usage: only 1.6GB of 8GB available.

The core insight of this project: not every workload fits in 8GB of VRAM. The solution is routing intelligently between GPU and CPU rather than forcing everything through one path.

OpenWebUI  :3000  (Docker)
    β”‚
    β”œβ”€β”€β–Ί llama-server  :8081  ──►  RX 580 Vulkan  [llama.cpp]
    β”‚         └── Ollama      :11434  ──►  CPU fallback
    β”‚
    └──► sd-server     :7860  ──►  RX 580 Vulkan  [stable-diffusion.cpp]
              β”œβ”€β”€ SD 1.5 GGUF      ──►  72s / image   βœ…
              └── FLUX hybrid      ──►  ~14 min / image  βœ…
    
    └──► ComfyUI       :8188  ──►  Xeon CPU WSL2   [heavy models > 8GB VRAM]

Path 1 β€” GPU Vulkan (RX 580): All LLM inference + SD 1.5 image generation. Fast, responsive, daily driver.

Path 2 β€” CPU Xeon (WSL2): FLUX.1 16GB models, AnimateDiff video pipelines. Slow but stable. The 32GB ECC RAM acts as "virtual VRAM."

This trips up almost everyone.

Source Compatible with
city96 (HuggingFace)
ComfyUI + ComfyUI-GGUF node only
leejet (HuggingFace)
stable-diffusion.cpp / sd-server βœ…

Using a city96

GGUF in sd-server

returns:

[ERROR] main.cpp:92 - new_sd_ctx_t failed

Always download FLUX weights from: huggingface.co/leejet/FLUX.1-schnell-gguf

We documented every dead end. These aren't opinions β€” they're error logs.

Attempt Error Root Cause
DirectML + ComfyUI
NotImplementedError: Cannot access storage of OpaqueTensorImpl
DirectML wraps tensors in opaque objects that ComfyUI's attention backends can't read. Also: abandoned by Microsoft, last update Sep 2024.
ROCm on Polaris
Kernel panics under load AMD officially dropped GCN4/Polaris in ROCm v5.x. No Windows support either.
OpenVINO + Forge
ModuleNotFoundError: No module named 'ldm'
Extension targets old A1111 architecture. Forge restructured ldm /sgm modules completely.
CPU-only + HDD
~19 min/image, 85s startup No GPU acceleration + mechanical I/O bottleneck. The HDD was the hidden killer.
torch-directml + Applio
Version conflict torch-directml requires torch==2.4.1 . Applio requires torch==2.7.1 . Irreconcilable.

Full autopsy with logs: docs/what-failed.md

Run these commands in

Developer PowerShell for Visual Studio.

cd E:\
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
cmake -B build -DGGML_VULKAN=ON -DCMAKE_BUILD_TYPE=Release
cmake --build build --config Release -j20

cd build\bin\Release
.\llama-cli.exe --list-devices

.\llama-server.exe -m "E:\models\Mistral-7B-Q4_K_M.gguf" `
  --host 0.0.0.0 --port 8081 --device Vulkan0

Verify it's using the GPU (not CPU):

log output during inference:
ggml_vulkan: Found 1 Vulkan device(s)
ggml_vulkan: 0 = AMD Radeon RX 580 2048SP | VRAM: 8192MB
17.77 t/s  ← RX 580 Vulkan βœ…

If you see 3–5 t/s with no ggml_vulkan

line β€” it's running on CPU. Check that --device Vulkan0

is present.

git clone --recursive https://github.com/leejet/stable-diffusion.cpp
cd stable-diffusion.cpp
mkdir build && cd build
cmake .. -DGGML_VULKAN=ON -DCMAKE_BUILD_TYPE=Release
cmake --build . --config Release -j20


E:
cd "E:\stable-diffusion.cpp\build\bin\Release"
.\sd-server.exe --listen-ip 0.0.0.0 --listen-port 7860 `
  -m "E:\models\dreamshaper8.gguf"

Flag compatibility note:Older builds use--host

/--port

. Newer builds (master-600+) use--listen-ip

/--listen-port

. Runsd-server.exe --help

to check which your build expects.

FLUX.1 Schnell requires ~16GB total. The strategy: put the diffusion model on VRAM, offload T5XXL and VAE to RAM.

Component File Allocation Size
Diffusion Model flux1-schnell-q4_k.gguf
GPU (VRAM) ~6.5 GB
VAE ae.safetensors
CPU (RAM) ~160 MB
CLIP L clip_l.safetensors
GPU (VRAM) ~235 MB
T5XXL t5xxl_fp16.safetensors
CPU (RAM) ~9.3 GB
sd-server.exe --listen-ip 0.0.0.0 --listen-port 7860 ^
  --diffusion-model "E:\models\flux1-schnell-q4_k.gguf" ^
  --vae "E:\models\ae.safetensors" ^
  --clip_l "E:\models\clip_l.safetensors" ^
  --t5xxl "E:\models\t5xxl_fp16.safetensors" ^
  --cfg-scale 1.0 --steps 4 --clip-on-cpu --vae-on-cpu --vae-tiling

--vae-tiling

is not optional β€” without it, VAE decode causes OOM and crashes the server. To save RAM: replacet5xxl_fp16

(~9.3GB) witht5xxl_fp8

(~5GB).

Timing per image (1024Γ—1024):

Stage Time
T5XXL conditioning 11.49s
Sampling (4 steps) ~838s
VAE decode (9 tiles) 40.45s
Total
~14 min

Full memory architecture: docs/flux-setup.md

docker run -d \
  -p 3000:8080 \
  --add-host=host.docker.internal:host-gateway \
  -v open-webui:/app/backend/data \
  --name open-webui \
  --restart always \
  ghcr.io/open-webui/open-webui:main

Connect LLM server:

  • Go to http://localhost:3000

β†’ Admin Panel β†’ Settings β†’ Connections - Under OpenAI API, add:

  • URL: http://host.docker.internal:8081/v1

  • API Key: sk-local

  • URL:

  • Green badge = connected βœ…

Connect image server:

  • Settings β†’ Images β†’ Engine: Automatic1111
  • URL: http://192.168.x.x:7860/

(use your local IP, not 127.0.0.1, with trailing slash)

Never use

127.0.0.1

for Docker connections β€” Docker runs in an isolated network and cannot reach the host's localhost. Usehost.docker.internal

for services, or your machine's LAN IP.

Windows Firewall fix (Docker subnet blocked by default):

New-NetFirewallRule -DisplayName "sd-server AIVisionsLab" `
  -Direction Inbound -Protocol TCP -LocalPort 7860 -Action Allow

Full networking guide: docs/firewall-fix.md

Vulkan-accelerated audio transcription. The large-v3-turbo

model uses only 2.6GB of VRAM β€” plenty of headroom.

Compile (Developer PowerShell):

& "C:\Program Files (x86)\Microsoft Visual Studio\...\vcvars64.bat"

cd C:\
git clone https://github.com/ggml-org/whisper.cpp
cd whisper.cpp
cmake -B build -DGGML_VULKAN=ON -DGGML_HIPBLAS=OFF -DGGML_HIP=OFF -DGGML_CUDA=OFF
cmake --build build --config Release -j4

Download model:

Invoke-WebRequest `
  -Uri "https://huggingface.co/ggerganov/whisper.cpp/resolve/main/ggml-large-v3-turbo.bin" `
  -OutFile "models\ggml-large-v3-turbo.bin"

Transcribe (MP4 β†’ TXT):

ffmpeg -i "video.mp4" -ar 16000 -ac 1 -c:a pcm_s16le "audio.wav"

.\build\bin\Release\whisper-cli.exe `
  -m models\ggml-large-v3-turbo.bin `
  -f "audio.wav" -l pt --output-txt

.\build\bin\Release\whisper-cli.exe `
  -m models\ggml-large-v3-turbo.bin `
  -f "audio.wav" -l pt --translate --output-txt

Performance (15-min video, Windows):

Stage Time
Model load 4s
Mel spectrogram 1.2s
GPU encode 73s
Decode + batch 168s
Total
307s

VRAM used: 2.6GB of 8GB. CPU stays at ~5%.

⚠️ WSL2 does not expose the RX 580 to Vulkan β€” always use native Windows PowerShell for GPU transcription.⚠️ --translate

only outputs English. For other target languages, add a translation step after.

Full pipeline: Text β†’ Balabolka (TTS) β†’ WAV β†’ Applio RVC (voice conversion) β†’ final audio

Why this pipeline instead of pure TTS:

Aspect Pure XTTS AntΓ΄nio Neural β†’ Yuri RVC
Prosody Artificial Human (real actor)
Long texts Degrades Stable
Vocal identity Generic Cloned
Naturalness 60–70% 80–95%

Key findings for AMD Windows (2026):

DirectML acceleration is effectively dead β€” torch-directml

requires torch==2.4.1

while Applio requires torch==2.7.1

. The version conflict is irreconcilable. Use CPU mode β€” it works, just takes time.

Training speed on Xeon E5-2690 v3: ~6 min/epoch. 200 epochs = ~20 hours.

Critical gotchas:


dir logs\my-project\extracted\   # Must contain .npy files

Create required mute files (missing from git install):

python -c "
import numpy as np, soundfile as sf, os
[os.makedirs(d, exist_ok=True) for d in [
    'logs/mute/sliced_audios','logs/mute/extracted',
    'logs/mute/f0','logs/mute/f0_voiced'
]]
sf.write('logs/mute/sliced_audios/mute40000.wav', np.zeros(int(40000*3.7)), 40000)
sf.write('logs/mute/sliced_audios/mute48000.wav', np.zeros(int(48000*3.7)), 48000)
np.save('logs/mute/extracted/mute.npy', np.zeros((196, 768)))  # shape (196,768) critical
np.save('logs/mute/f0/mute.wav.npy', np.zeros(100))
np.save('logs/mute/f0_voiced/mute.wav.npy', np.zeros(100))
print('OK')
"

Full guide: docs/applio-rvc.md

AnimateDiff injects temporal attention modules into SD 1.5, converting still-image diffusion into coherent video loops. Runs on Xeon CPU via ComfyUI in WSL2.

conda activate comfy_env
python main.py --cpu --listen 0.0.0.0 --port 8188

Access from Windows: http://localhost:8188

Performance: ~141 seconds per frame on Xeon E5-2690 v3 (24 threads).

Bare-metal Linux (no WSL2, no Docker GPU passthrough) with Mesa RADV open-source drivers.

System: Ubuntu 26.04 LTS (Resolute Raccoon), Kernel 7.0, Mesa RADV 26.0.3, Vulkan 1.4.341

Validate GPU:

lspci | grep -i vga

vulkaninfo --summary 2>/dev/null | grep -A5 "Devices"

LLM server:

~/llama.cpp/build/bin/llama-server \
  -m "/run/media/user/NVMe/models/Qwen3-4B-Q4_K_M.gguf" \
  --host 0.0.0.0 --port 8081 \
  -ngl 99 -t 24

FLUX image server:

~/stable-diffusion.cpp/build/bin/sd-server \
  --listen-ip 0.0.0.0 --listen-port 7860 \
  --diffusion-model /path/to/flux1-schnell-q4_k.gguf \
  --vae /path/to/ae.safetensors \
  --clip_l /path/to/clip_l.safetensors \
  --t5xxl /path/to/t5xxl_fp16.safetensors \
  --cfg-scale 1.0 --steps 4 --clip-on-cpu --vae-on-cpu --vae-tiling

--vae-tiling

is mandatory on Linux too β€” without it, VAE decode crashes the GNOME display server. Avoid--backend vulkan0

for heavy models on Linux β€” causes context-loss bugs.

Whisper transcription:

~/whisper.cpp/build/bin/whisper-cli \
  -m ~/whisper.cpp/models/ggml-large-v3-turbo.bin \
  -f "audio.wav" -l pt --output-txt

Docker services running in parallel:

Container Image Port Purpose
open-webui ghcr.io/open-webui/open-webui:main
3000 Chat UI
portainer portainer/portainer-ce
9000 Docker management
searxng searxng/searxng:latest
8080 Private search for RAG

⚠️ ROCm is not usable on Polaris/GCN4.AMD dropped support. Running Ollama with GPU via Docker on RX 580 will fail. Usellama-server

compiled with Vulkan instead, and keep Docker for frontends only.

Workload Windows 10 Ubuntu 26.04 (Mesa RADV) Winner
LLM Qwen3 4B @ 99 layers ~15–17 tok/s ~35 tok/s
πŸ† Linux (2Γ—)
LLM Qwen3.6 35B @ max layers 7.62 tok/s (max 10 layers) 5.18 tok/s (max 20 layers) βš–οΈ Technical tie
SD 1.5 DreamShaper (50 steps) ~72s
~85s πŸ† Windows
FLUX Schnell (4 steps, 512Γ—512) ~84s ~52s sampling (~95s total)
πŸ† Linux
Whisper large-v3-turbo (106s audio) 307s Β· 2.6GB VRAM 23.58s Β· 1.6GB VRAM
πŸ† Linux (absurd)

Why Linux is faster for LLM: Mesa RADV allows up to 20 GPU layers for the 35B model where Windows AMD drivers cap at 10. For smaller models, RADV's memory management is simply more efficient.

Why Windows wins SD 1.5: The proprietary AMD driver has more stable direct rendering for this specific workload.

Whisper gap explained: Mesa RADV's Vulkan compute path for whisper.cpp is significantly more optimized than the Windows AMD driver equivalent. A 13Γ— speedup on the same GPU, same model.

** generate_image returned no results / frozen terminal** Cause:

sd-server

integer overflow bug with random seeds (Seed: -1

). Fix: Set a fixed integer seed in OpenWebUI advanced options (e.g., 42

, 1337

).Model trained successfully instantly (Applio) This is a silent failure. Training completes in seconds and produces nothing. Cause: CUDA_VISIBLE_DEVICES=-1

or similar environment variables were set, breaking feature extraction. Fix: Open a clean PowerShell with no prior set

commands. Verify logs/project/extracted/

contains .npy

files after extraction before starting training.

FLUX OOM / DeviceMemoryAllocation crash Fix: Ensure --vae-tiling

flag is present. Confirm T5XXL is on CPU (--clip-on-cpu --vae-on-cpu

). Consider switching to t5xxl_fp8

to save ~4.3GB RAM.

** new_sd_ctx_t failed with FLUX GGUF** You're using a

city96

GGUF. These only work in ComfyUI with the ComfyUI-GGUF node. Fix: Download from leejet's repoinstead.

Docker can't reach sd-server or llama-server Cause: Windows Defender blocks Docker's

172.x.x.x

subnet by default. Fix: See OpenWebUI + Docker Integrationβ€” add the firewall rule.

Compilation errors in WSL2 for Vulkan builds WSL2 does not expose the RX 580 for Vulkan compute. Compile and run GPU workloads from native Windows PowerShell only. Use WSL2 exclusively for CPU workloads (ComfyUI, Applio, Ollama CPU fallback).

** --override-tensor exps=CPU slows down inference on Vulkan** This flag is optimized for CUDA/PCIe on Nvidia. Under Vulkan, the CPU↔GPU memory transfer overhead destroys any MoE off gains. Do not apply CUDA-optimized flags to Vulkan backends.

Save as iniciar_ia_server.bat

on the Desktop:

@echo off
title Servidor IA Local - Producao
cls

:: Kill ghost processes holding VRAM/ports
taskkill /f /im sd-server.exe 2>nul
taskkill /f /im llama-server.exe 2>nul
timeout /t 2 /nobreak >nul

:: Start LLM server (Vulkan)
start "LLM Server - Vulkan RX580" C:\llama.cpp\build\bin\Release\llama-server.exe ^
  -m "E:\models\Mistral-7B-Q4_K_M.gguf" ^
  --host 0.0.0.0 --port 8081 --device Vulkan0

timeout /t 3 /nobreak >nul

:: Start SD server (Vulkan)
E:
cd "E:\stable-diffusion.cpp\build\bin\Release"
sd-server.exe --listen-ip 0.0.0.0 --listen-port 7860 ^
  -m "E:\models\dreamshaper8.gguf"


Critical rules:

taskkill

before start: releases VRAM from stuck background processes--host 0.0.0.0

: required for Docker to reach the server--device Vulkan0

: without this, inference falls back to CPU (3–5 tok/s)- Never use .\

before executables in CMD β€” it breaks the shell - Jump drive ( E:

) beforecd

β€” CMD doesn't change drives automatically

Instant validation scripts β€” run before building anything.

./vulkan-diagnostic.sh
:: Windows CMD
vulkan-diagnostic.bat

Expected output:

ggml_vulkan: Found 1 Vulkan device(s)
ggml_vulkan: 0 = AMD Radeon RX 580 2048SP | VRAM: 8192MB  βœ…

If your card doesn't appear β€” driver or Vulkan SDK issue. See Master Documentation.

Three independent researchers. Same GPU. Same conclusion: the hardware was never the problem.

Date Author Contribution
Jan 2025 θ‰Ύη±³εΏƒ Amihart First documented LLM via Vulkan on RX 580 β€” 24.56 tok/s on Debian. Declared SD via Vulkan "not viable" (limitation of sd.cpp at that time).
Dec 2025 DH / DadHacks Refuted Amihart's SD conclusion. Used stable-diffusion.cpp with -DSD_VULKAN=ON , ran FLUX Schnell GGUF generation on RX 580 from terminal.
2026 AIVisionsLab Full Windows production stack: Vulkan LLM + SD + FLUX hybrid + OpenWebUI + Docker networking + Applio RVC + AnimateDiff + whisper.cpp + Linux native benchmarks.
Capability Amihart DadHacks AIVisionsLab
LLM Vulkan βœ… 24.56 tok/s βœ… βœ… 15–35 tok/s
SD via Vulkan ❌ βœ… CLI βœ… Server + API
FLUX GGUF ❌ βœ… CLI βœ… Hybrid GPU/CPU
GUI / OpenWebUI Docker only ❌ βœ… Full integration
Windows native ❌ ❌ βœ…
Automation scripts ❌ ❌ βœ… .bat double-click
Voice cloning ❌ ❌ βœ… Applio RVC
Video / AnimateDiff ❌ ❌ βœ…
Audio transcription ❌ ❌ βœ… whisper.cpp
Linux native (Ubuntu 26.04) Debian Debian βœ… Ubuntu 26.04 LTS
GGUF format mapping ❌ ❌ βœ… city96 vs leejet

The shared technical foundation: ggml

/ llama.cpp

/ stable-diffusion.cpp

by Georgi Gerganov. Vulkan compute backends in pure C++ that bypass the entire ROCm/CUDA ecosystem.

Credit: θ‰Ύη±³εΏƒ (Amihart), DH (DadHacks), leejet, ggerganov, woodrex, and all independent developers working on hardware preservation and open inference.

Two lab sessions pushed the dual-path stack to its extreme: running a 34.66B-parameter MoE model (Qwen3.5-35B) on the same RX 580 8GB, using llama.cpp's automatic GPU/RAM fitting across 4 memory tiers (VRAM β†’ DDR4 ECC β†’ NVMe β†’ HDD swap).

Quick links β€” six focused docs, one question each:

Doc Answers

BenchmarksThinking mode context overflowOpenWebUI timeout vs server truncationctx-size and quantization tuningModel reasoning about its own architecture<think>

traces look like?Full narrative lab reports (raw logs, timelines, complete test history):

Key takeaway: the RX 580 never crashed or throttled across either session (peak 80Β°C, limit ~90Β°C). Every failure traced back to software-side timeouts and context-buffer limits β€” not hardware capacity. With --ctx-size 8192

and Q4_K_M quantization, a 35B MoE model runs stable, full responses included, entirely on a 2017 GPU.

GPU from 2017 + CPU from 2014  ──►  34.66B parameters  ──►  5.6–6.6 tok/s
rx580-local-ai-guide/
β”‚
β”œβ”€β”€ scripts/
β”‚   β”œβ”€β”€ start-ai.bat              # Full stack β€” all services
β”‚   β”œβ”€β”€ iniciar_sd_server.bat     # SD 1.5 only
β”‚   β”œβ”€β”€ iniciar_flux_server.bat   # FLUX hybrid GPU/CPU
β”‚   β”œβ”€β”€ reboot_stack.bat          # Kill all + restart
β”‚   β”œβ”€β”€ vulkan-diagnostic.bat     # Vulkan validation (Windows)
β”‚   β”œβ”€β”€ vulkan-diagnostic.sh      # Vulkan validation (Linux/WSL2)
β”‚   β”œβ”€β”€ build-llamacpp.sh         # Compile llama.cpp (Linux/WSL2)
β”‚   └── build-sdcpp.sh            # Compile stable-diffusion.cpp
β”‚
β”œβ”€β”€ docs/
β”‚   β”œβ”€β”€ benchmarks.md             # Real hardware logs, full tables
β”‚   β”œβ”€β”€ what-failed.md            # DirectML, ROCm, OpenVINO autopsy
β”‚   β”œβ”€β”€ flux-setup.md             # FLUX hybrid memory architecture
β”‚   β”œβ”€β”€ firewall-fix.md           # Docker + Windows Firewall
β”‚   β”œβ”€β”€ wsl2-setup.md             # ComfyUI CPU on WSL2
β”‚   β”œβ”€β”€ applio-rvc.md             # Voice cloning full guide
β”‚   β”œβ”€β”€ whisper-cpp.md            # Audio transcription guide
β”‚   β”œβ”€β”€ linux-ubuntu2604.md       # Ubuntu 26.04 bare-metal guide
β”‚   β”œβ”€β”€ qwen35-35b-hybrid-experiment.md     # 35B MoE hybrid limit test β€” full log (Session 1)
β”‚   β”œβ”€β”€ qwen35-35b-proving-hypothesis.md    # 35B MoE ctx-size/curl proof β€” full log (Session 2)
β”‚   └── qwen35-35b/                         # Atomic SEO-focused docs, one question per page
β”‚       β”œβ”€β”€ README.md
β”‚       β”œβ”€β”€ running-35b-on-8gb-vram.md
β”‚       β”œβ”€β”€ benchmarks.md
β”‚       β”œβ”€β”€ thinking-mode-context-overflow.md
β”‚       β”œβ”€β”€ openwebui-timeout-vs-server-truncation.md
β”‚       β”œβ”€β”€ ctx-size-and-quantization-tuning.md
β”‚       └── model-reasoning-about-its-own-architecture.md
β”‚
β”œβ”€β”€ vulkan-diagnostic.bat         # Quick validation (root, Windows)
β”œβ”€β”€ vulkan-diagnostic.sh          # Quick validation (root, Linux)
└── README.md

llama.cppβ€” The engine behind LLM inferencestable-diffusion.cppβ€” Image generation in C++whisper.cppβ€” Audio transcription in C++OpenWebUIβ€” The chat/image interfaceAIVisionsLab Portalβ€” Full documentation (PT/EN)

This guide is built from real hardware testing, real error logs, and real failures. If you:

  • Got this working on a related card (RX 570, RX 590, RX 5500 XT)
  • Found better build flags or quantization settings for Vulkan
  • Have benchmarks from a different CPU/RAM configuration
  • Fixed a bug in the scripts

Open a PR or issue. Everything here is MIT licensed β€” use it, fork it, share it.

MIT β€” do whatever you want with this. Just don't tell people their old GPU is useless.

Built in SΓ£o Paulo, Brazil πŸ‡§πŸ‡· Β· Hardware from 2014–2017 Β· Running SOTA AI in 2026

"The problem was never the GPU."

── more in #ai-infrastructure 4 stories Β· sorted by recency
── more on @amd 3 stories trending now
sponsored brought to you by zahid.host 4,200+ EU-deployed projects
reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain β€” perfect for shipping the agent you just read about.

$git push zahid main
β†’ Live at https://your-agent.zahid.host βœ“
Get free account β†’ Pricing
from €0/mo Β· no card required
LIVE [news/running-a-35b-moe-mo…] indexed:0 read:17min 2026-06-20 Β· β€”