A Dummy's Guide to AMD GPU Issues on Linux - Understanding RDNA3, TLB Fences, and Kernel Parameters This article is a beginner-friendly guide to diagnosing and fixing AMD GPU issues on Linux, particularly for RDNA3 (RX 7000 series) cards. It explains common symptoms like system freezes and "fence timeout" errors, identifies a known kernel bug in versions 6.14-6.17 as a primary cause, and provides troubleshooting steps including specific kernel parameters and diagnostic commands. A beginner-friendly guide to understanding and fixing AMD GPU crashes, freezes, and instability on Linux. - Common Symptoms - Understanding the Terminology - Diagnosing Your GPU Issues - Common AMD GPU Problems - Kernel Parameters Explained - Step-by-Step Troubleshooting You might be experiencing AMD GPU issues if you see: - System freezes/crashes randomly - Black screens - "GPU hung" or "fence timeout" messages in logs - Display flickering or artifacts - Messages about "overdrive" or "power management" - Applications crash when using GPU acceleration RDNA/RDNA2/RDNA3: AMD's GPU architecture generations - RDNA: RX 5000 series e.g., RX 5700 XT - RDNA2: RX 6000 series e.g., RX 6800 XT - RDNA3: RX 7000 series e.g., RX 7900 XTX, RX 7700 XT Navi 10/21/23/31/32/33: Code names for specific GPU chips - Navi 32 = RX 7700 XT / 7800 XT - Navi 33 = RX 7600 - Navi 31 = RX 7900 XTX / XT GFX Version: Internal GPU identifier e.g., gfx1101 for RDNA3 DMA Direct Memory Access : How the GPU accesses system memory without involving the CPU. Think of it as a direct highway between GPU and RAM. TLB Translation Lookaside Buffer : A cache that translates memory addresses. Like a phone book for memory locations. Fence Timeout: When the GPU promises to finish a task by a deadline but fails to do so. The system waits... and waits... and eventually gives up, causing a crash. TLB Fence Timeout: The specific problem where the GPU can't complete memory translation tasks in time. This is a known bug in RDNA3 GPUs on certain Linux kernels. IOMMU Input-Output Memory Management Unit : Hardware that manages memory access for devices. Sometimes causes conflicts with AMD GPUs. SMU System Management Unit : Firmware that controls GPU power, clocks, and thermal management. Power DPM Dynamic Power Management : System that adjusts GPU clock speeds and voltage based on workload. Overdrive: AMD's term for overclocking features. When people say "overdrive enabled," it usually just means the GPU can boost its clocks. AMDGPU: The open-source Linux kernel driver for modern AMD GPUs ROCm: AMD's compute platform for GPU computing like CUDA for Nvidia Mesa: The open-source graphics stack that implements OpenGL, Vulkan, etc. Firmware: Low-level software that runs on the GPU itself Check kernel messages for GPU errors sudo dmesg | grep -i "amdgpu\|gpu\|fence\|timeout" | tail -50 Check system logs sudo journalctl -b -0 --no-pager | grep -i "amdgpu\|gpu hung\|fence" | tail -50 TLB Fence Issues RDNA3 specific : amdgpu tlb fence work dma fence wait timeout Trying to push to a killed entity → This is a kernel bug, not a hardware problem Power Management Issues: amdgpu: GPU recovery enabled runtime pm gfx off → Power features causing instability Firmware Mismatches: SMU driver if version not matched → Driver and firmware versions don't match Display Issues: DC Display Core DMUB Display Microcontroller → Display subsystem problems Get GPU info lspci | grep -i vga ROCm info if installed rocm-smi --showproductname Check GFX version grep "GFX Version" /var/log/Xorg.0.log Problem: GPU freezes, system hangs, "fence timeout" in logs Affected: Mainly RX 7000 series on kernels 6.14-6.17 Cause: Kernel bug in memory management Solution: Kernel parameters see below Problem: System freezes during idle or wake from sleep Affected: Most AMD GPUs Cause: Aggressive power saving features Solution: Disable runtime PM and GFX off Problem: Random black screens, flickering Affected: Multi-monitor setups, high refresh rate Cause: Display Core DC bugs Solution: DC-specific kernel parameters Problem: GPU performance drops, thermal warnings Affected: All GPUs with inadequate cooling Cause: Poor airflow, dust, faulty firmware Solution: Physical cleaning, firmware update, custom fan curves Kernel parameters are settings you pass to the Linux kernel at boot. They're added in /etc/default/grub in the GRUB CMDLINE LINUX DEFAULT line. amdgpu.tmz=0 - What: Disables Trusted Memory Zone - Why: TMZ has bugs on RDNA3, causes freezes - When to use: RDNA3 GPUs with random crashes amdgpu.sg display=0 - What: Disables scatter-gather for display - Why: Reduces DMA fence timeouts - When to use: Display issues, TLB fence timeouts amdgpu.dcdebugmask=0x10 - What: Disables certain Display Core debugging features - Why: DC debugging can cause hangs - When to use: Display-related freezes iommu=soft - What: Uses software IOMMU instead of hardware - Why: Hardware IOMMU can conflict with AMD GPUs - When to use: DMA/fence timeout issues amdgpu.gpu recovery=1 - What: Enables automatic GPU recovery after hangs - Why: GPU can reset itself instead of crashing system - When to use: Always recommended amdgpu.gfx off=0 - What: Disables GFX power gating - Why: GFX off state causes crashes on some GPUs - When to use: Idle crashes, wake-from-sleep issues amdgpu.runpm=0 or amdgpu.runtime pm=0 - What: Disables runtime power management - Why: Runtime PM causes suspend/resume crashes - When to use: Sleep/wake issues amdgpu.ppfeaturemask=0xffffffff - What: Enables all power play features - Why: Sometimes disabling features causes more problems - When to use: When conservative settings don't work amdgpu.dc=0 - What: Disables Display Core uses legacy display code - Why: DC has bugs, legacy is more stable - When to use: Last resort for display issues loses features Edit /etc/default/grub : For RDNA3 TLB fence issues GRUB CMDLINE LINUX DEFAULT="quiet splash amdgpu.tmz=0 amdgpu.sg display=0 amdgpu.dcdebugmask=0x10 iommu=soft" For general stability GRUB CMDLINE LINUX DEFAULT="quiet splash amdgpu.gpu recovery=1 amdgpu.gfx off=0 amdgpu.runpm=0" After editing, update GRUB: sudo update-grub - Identify your GPU: lspci | grep VGA - Check kernel version: uname -r - Check for errors in logs: sudo dmesg | grep -i amdgpu | tail -50 sudo journalctl -b -0 | grep -i "fence\|timeout" | tail -20 - Check driver version: modinfo amdgpu | grep version - Check firmware version: sudo dmesg | grep "smu fw version" - Update everything: sudo apt update && sudo apt upgrade sudo apt install linux-firmware - Try a different kernel: - Reboot and select an older kernel from GRUB menu - For RDNA3: kernel 6.11.x often more stable than 6.14+ - Add basic stability parameters: sudo nano /etc/default/grub Add to GRUB CMDLINE LINUX DEFAULT: amdgpu.gpu recovery=1 amdgpu.gfx off=0 sudo update-grub sudo reboot - For TLB Fence Timeouts RDNA3 : Add these parameters: amdgpu.tmz=0 amdgpu.sg display=0 amdgpu.dcdebugmask=0x10 iommu=soft - For Power Management Issues: Add these parameters: amdgpu.runpm=0 amdgpu.gfx off=0 - For Display Issues: Try these one at a time: amdgpu.dcdebugmask=0x10 amdgpu.dc=0 Last resort - loses features - Create a modprobe config alternative to kernel parameters : sudo nano /etc/modprobe.d/amdgpu.conf Add: options amdgpu gpu recovery=1 options amdgpu gfx off=0 options amdgpu tmz=0 Then: sudo update-initramfs -u sudo reboot If nothing else works: - Try the proprietary driver AMDGPU-PRO : - Not recommended for gaming - Better for compute workloads - Download from AMD website - Downgrade to an older kernel: Install older kernel sudo apt install linux-image-6.11.0-8-generic Boot into it from GRUB menu - File a bug report: - Check existing bugs: https://gitlab.freedesktop.org/drm/amd/-/issues - Include: dmesg output, GPU model, kernel version, reproduction steps See active kernel parameters cat /proc/cmdline Check specific amdgpu parameter cat /sys/module/amdgpu/parameters/gpu recovery Watch GPU clocks and temperature ROCm watch -n 1 rocm-smi Check power management state cat /sys/class/drm/card0/device/power dpm force performance level Check current GPU clocks cat /sys/class/drm/card0/device/pp dpm sclk cat /sys/class/drm/card0/device/pp dpm mclk OpenGL stress test glxgears -fullscreen Vulkan stress test vkcube Compute test if ROCm installed rocm-smi --showtemp --showpower --showclocks Myth: "Overdrive causes crashes" Reality: Overdrive is just AMD's term for boost clocks. The message is usually harmless. Myth: "AMD GPUs don't work on Linux" Reality: They work great RDNA3 just has some kernel bugs being fixed. Myth: "You need proprietary drivers" Reality: The open-source AMDGPU driver is excellent and recommended. Myth: "Lowering clocks fixes stability" Reality: Usually doesn't help. Most issues are driver/kernel bugs, not hardware limits. Myth: "More power management = better" Reality: Aggressive power saving often causes more crashes than it's worth. - Check logs first - 90% of diagnosis is reading dmesg/journalctl - Search existing issues - Your problem is probably known - Provide details: - GPU model exact SKU - Kernel version - Driver version - Full dmesg output showing the error - Steps to reproduce - AMD GPU Linux Kernel Driver: https://gitlab.freedesktop.org/drm/amd - Mesa Graphics: https://gitlab.freedesktop.org/mesa/mesa - ROCm: https://github.com/RadeonOpenCompute/ROCm - Arch Wiki excellent resource : https://wiki.archlinux.org/title/AMDGPU - Ubuntu AMD GPU Guide: https://help.ubuntu.com/community/RadeonDriver Note: This guide is based on real-world troubleshooting of RDNA3 GPU issues on Linux. Always back up your system before making changes, and remember that kernel/driver bugs get fixed over time - sometimes just waiting for updates is the best solution. Disclaimer: Information provided is for educational purposes. The author is not responsible for any system instability or data loss. Always maintain backups and test changes carefully. Generated by Claude Code - Verify all technical information before applying to production systems.