OpenXLA and JAX - ROCm Support and the State of CI

wpnews.pro

The OpenXLA compiler stack — XLA at the foundation, JAX as the front end — now runs upstream on AMD ROCm. XLA gates every pull request on real AMD Instinct silicon through its GitHub Actions workflow, side by side with the CUDA path; JAX runs the same hardware on every ROCm PR through its own workflows, with the merge gate rolling out next. pip install "jax[rocm7-local]"

is a first-class entry point. This post documents how that backend is structured, what landed in the last twelve months, and how the CI pipeline that keeps it healthy is wired together. ** Part 1** covers OpenXLA on AMD — the XLA backend, what landed this year, and CI.

covers JAX on AMD — the plugin architecture, JAX-side changes, and the four-workflow test matrix.

Part 2

What’s in this post

The OpenXLA stack on ROCm in one diagram.

Supported AMD GPU targets across compiler, default wheels, and CI gating — and why those three lists differ.

The year’s work on XLA and JAX for ROCm: Triton on AMDGPU, hipBLASLt group-GEMM, FP8, hermetic builds, manylinux wheels.

End-to-end CI: how each PR runs on real Instinct hardware and how XLA and JAX cross-pin against each other.

A reproducible three-command quick start.

Where to file issues, send PRs, and dump HLO when a workload misbehaves.

Why JAX/OpenXLA on AMD?# #

JAX on AMD is a deliberate architectural choice. The case for picking it over an eager framework — or over a hand-tuned kernel library — rests on two properties: a whole-program compiler that sees the entire High Level Operations (HLO) graph at lowering time, and a programming model where collectives are inferred from sharding annotations rather than spelled out at the call site.

The technical case. XLA is a hybrid ahead-of-time and just-in-time compiler that consumes a whole HLO graph — often hundreds of ops — and lowers it through fusion and autotuning, dispatching to Triton with LLVM-AMDGPU codegen or to ROCm math libraries (hipBLASLt, MIOpen, and others) according to which provides the best performance. For transformer-shaped pretraining and inference — dense matmuls, attention, layer norms, large collectives — that compile-time view produces fused kernels that would otherwise demand hand-tuning per shape and per generation.

JAX provides the programming model on top: pure functions, composable transforms (jit

, grad

, vmap

, scan

), and SPMD parallelism through pjit

/ shard_map

over GSPMD. Collectives are inserted by the compiler from sharding annotations, so the same source runs on CPUs, TPUs, and GPUs — including our AMD Instinct product line.

Who should keep reading. This post is most directly useful if you are:

An

AMD Instinct customer running pretraining or large fine-tunes, evaluating which compiler stack to standardise on for MI300 / MI350 capacity.A

JAX user adding AMD as a second hardware target without rewriting model code.A

foundation-model lab doing SPMD / GSPMD pretraining and weighing Instinct + RCCL against NVIDIA + NCCL.A

compiler or ML-systems engineer contributing to OpenXLA or JAX on AMD — the CI sections answer “what will happen to my PR before I open it” for both repositories.A

maintainer of MaxText / MaxDiffusion-style reference workloads, or a researcher running scientific simulation on HPC systems (** NumPyro, JAX-MD, Brax**, AlphaFold-shaped models) who wants to target AMD GPUs on leadership-class systems.

Architecture# #

Figure 1 shows the full path from a JAX program to a GPU kernel — five layers, colour-coded by ownership.

Figure 1. The OpenXLA stack on AMD ROCm, from your Python code down to the kernel that HIP launches on the GPU.

Blue is user code; yellow is JAX in Python; green is XLA and its PJRT plugin; red is the ROCm runtime (HIP, math libraries, kernel driver); grey is silicon. The {N}

in jax-rocm{N}-plugin

and jax-rocm{N}-pjrt

is the major ROCm version — today 7

(for ROCm 7.x), resolved at import time. The two AMD-shaped boxes are the only divergence from the CUDA path; everything above and below is shared. That clean factoring is what makes upstream CI tractable, and what the rest of this post is built around.

Quick Start# #

The shortest path from “I have an Instinct GPU” to “I just compiled a JAX program through XLA on it”:

docker run -it --rm \
  --device=/dev/kfd --device=/dev/dri --group-add video \
  --shm-size=64G --ipc=host \
  rocm/jax:latest \
  python -c "import jax, jax.numpy as jnp; \
print(jax.devices()); \
print(jax.jit(lambda x: jnp.tanh(x @ x.T))(jnp.ones((1024,1024), jnp.bfloat16)).shape)"

A RocmDevice

in jax.devices()

and (1024, 1024)

on stdout means that a tanh(x @ x.T)

HLO graph was compiled through Triton plus the AMDGPU LLVM backend, dispatched through HIP, and returned a result. Full install paths — pip and source — are in Try It Out and Get Involved.

Hardware Support# #

Three lists matter, in order of narrowing scope: what the compiler recognises, what the default wheels build for, and what upstream CI exercises on every PR.

Compiler-Supported Architectures#

The authoritative list is kSupportedGfxVersions[]

in xla/stream_executor/rocm/rocm_compute_capability.h. XLA’s ROCm backend recognises and emits code for:

Product | Architecture | Compiler target | |---|---|---| Instinct MI200 series (MI210X / MI250X) | CDNA 2 | | Instinct MI300 series (MI300X / MI325X) | CDNA 3 | | Instinct MI350 series (MI350X / MI355X) | CDNA 4 | | Radeon RX 6800 / 6900 | RDNA 2 | | Radeon RX 7900 | RDNA 3 | | Radeon RX 7700 / 7800 | RDNA 3 | | Phoenix | RDNA 3 (APU) | | Strix Point | RDNA 3.5 (APU) | | Strix Halo | RDNA 3.5 (APU) | | Radeon RX 9000 series | RDNA 4 | |

Default Wheel Build Targets#

JAX’s published ROCm wheels are compiled for the actively supported subset of the compiler list above. From jax/build/rocm/rocm.bazelrc:

gfx908, gfx90a, gfx942, gfx950, gfx1030, gfx1100, gfx1101, gfx1200, gfx1201

The compiler list above stays broader than the wheel list on purpose: if your target isn’t in the default wheels, python build/build.py --rocm_amdgpu_targets=…

lets you build wheels for it locally — including older targets like gfx900

/ gfx906

.

CI-Gated Targets#

We gate each PR before merging by ensuring it runs on our top-of-the-line MI Instinct and Radeon hardware:

XLA upstream—rocm_ci.yml

gfx950

(MI350) on the single-GPU pool, covering theci_single_gpu

configuration.JAX upstream— AMD-hostedbazel_rocm.yml

/pytest_rocm.yml

linux-x86-64-{1,4,8}gpu-amd

pools spanning MI300 in CPX mode (gfx942

) for the Bazel/RBE jobs and MI350 (gfx950

) for the PyTest jobs.

What this means in practice.MI200, MI300, and MI350 Instinct, and RX 6800-class or newer Radeon: the published wheels should work out of the box. Vega (gfx900

/gfx906

) or an RDNA 3 APU: the compiler still supports you, but expect to build wheels with an explicit--rocm_amdgpu_targets

. Upstream CI density today is concentrated ongfx950

(MI350); other Instinct and Radeon targets are exercised in AMD’s downstream CI.

Part 1 — OpenXLA on AMD# #

Where the AMD Code Lives#

OpenXLA compiles HLO into fused, hardware-specific kernels. The GPU pipeline shares a large surface — HLO optimization, fusion, autotuning, command-buffer scheduling — and forks at codegen time into vendor-specific backends.

For AMD, that backend lives across two main areas of the openxla/xla tree:

xla/stream_executor/rocm/— the runtime layer: device discovery, streams, events, command buffers, memory, and wrappers around hipBLASLt, hipFFT, hipSOLVER, hipSPARSE, rocPRIM, MIOpen, and RCCL.xla/service/gpu/— the AMD compiler entry points, most notablyamdgpu_compiler.cc,custom_kernel_emitter_rocm.cc, and the LLVM AMDGPU backend atxla/service/gpu/llvm_gpu_backend/amdgpu_backend.cc.

The build glue that makes ROCm a hermetic, reproducible target lives under third_party/gpus/, with auto-detection in third_party/gpus/rocm_configure.bzl and third_party/gpus/find_rocm_config.py.

What Landed for ROCm in the Past Year#

Over the last twelve months, more than 300 commits on main

touched ROCm, HIP, AMDGPU codegen, or gfx9*

targets. Four themes account for most of the work.

Triton on AMDGPU. Triton is the primary code generator for fused matmul-plus-epilogue patterns in XLA:GPU, and bringing it to production parity on AMD GPUs took a sustained multi-PR effort. An AMD-specific shared-memory allocation pass in the Triton pipeline and a CDNA-aware waves_per_eu knob in the GEMM autotuner closed the gap on per-shape kernel quality;

scaled-dot loweringbrought microscaling MX-format GEMMs to AMD parity; and the early refactors in the Triton AllReduce series (

1,

2) laid the groundwork for Triton-fused collectives on ROCm.

GEMM and FP8 on Instinct. The headline landing was the five-part hipBLASLt group-GEMM enablement — production group-GEMM through

hipBLASLt

is now the default path on Instinct. FP8 is now declared a first-class ROCm 7 capabilitywith fast accumulation, and the compiler-side support check was

relaxed accordingly. End-to-end test coverage was extended to cover

both OCP and NANOO FP8 collective ops, plus the

for group-GEMM.

gfx950

HIP backend requirementsCollectives and rocPRIM. Hand-rolled fallbacks were replaced with tuned ROCm library primitives where they existed — most visibly rocprim::segmented_inclusive_scan in the batched row-scan path. Native ROCm collectives also landed in

: a full

xla/stream_executor/rocm/

all_reduce_kernel_rocm.cc

, multi-GPU barrier, and ragged all-to-all kernels that bring AMD off the CUDA-shim path for these operations.Build, hermeticity, and runtime hygiene. The hermetic LLVM toolchain is the largest gain — XLA’s ROCm build no longer depends on the host’s clang

version, which was the single biggest reproducibility hazard for downstream packagers. Other changes in the same vein streamlined the Bazel targets for ROCm libraries, made LoadKernel

use a ref-counted module path so cleanup is correct, propagated proper error status through the ROCm profiler, and fixed a subtle leading-comma bug in the AMDGPU feature string passed to LLVM.

Net effect.The JAX and XLA ROCm plugins are now at feature parity with the rest of the backends, and deliver strong performance on AMD Instinct GPUs for bf16 and FP8 transformer training and inference, large-scale collectives, and Triton-fused GEMM epilogues.

How ROCm Gets Tested in `openxla/xla`

#

A backend without CI is a backend that suffers from bit rot. The defining ROCm investment in openxla/xla

over the past year has been the unification of ROCm CI into a single upstream GitHub Actions workflow (PR #36893), driven from .github/workflows/rocm_ci.yml. Every PR against main

now runs through it on real AMD silicon before it can be merged.

The Workflow at a Glance#

Job | Runner Label | AMD Product | Coverage | |---|---|---|---| | | MI350 ( | JAX unit tests built against the PR’s XLA, single-GPU | | | MI350 ( | XLA’s own test suite under the |

Both jobs run inside the rocm/tensorflow-build:latest-jammy-pythonall-rocm7.2.1-ci_official

container, pinned by SHA digest for supply-chain hygiene. /dev/kfd

and /dev/dri

are mapped through, an 80 GiB tmpfs Bazel cache is mounted, and the video

group is added so HIP can reach the kernel driver. rocminfo

is invoked early in the run so a bad host fails the first step rather than burying the error in test logs.

Build-System Plumbing#

The CI is driven by Bazel --config

flags defined in build_tools/rocm/rocm_xla.bazelrc:

--config=rocm_rbe

— Remote Build Execution, parallelising build and test actions across many remote workers.--config=rocm_rbe_dynamic

— hybrid mode that builds locally but lets test actions schedule across local and remote, so a single PR can saturate both the on-prem GPU pool and the build farm.--config=ci_single_gpu

— wraps tests inbuild_tools/rocm/parallel_gpu_execute.shso multiple test shards can share the GPU safely, plus three flaky-test retries.

Test Selection#

Not every XLA test is meaningful on AMD GPUs — some are specific to other hardware platforms. The ROCm CI filters in two layers:

excludes roughly fifteen vendor-specific Bazel tags (rocm_tag_filters.sh

cuda-only

,requires-gpu-sm

, Intel-GPU, and similar) so test discovery stays tractable.The

test:xla_sgpu

list inrocm_xla.bazelrc

enumerates the exact targets the single-GPU pool runs, via explicit excludes.

The XLA job pulls execute_ci_build_upstream.sh from AMD’s

ROCm/xla

fork at workflow time. That gives the AMD CI team a fast iteration path on the runner-side script (test selection, failure triage, log post-processing) without round-tripping through openxla/xla

for every change. The workflow file, the Bazel configs, and the test target lists remain upstream and reviewable.## Part 2 — JAX on AMD#

JAX uses XLA as its compiler, but the ROCm story is not just “inherit XLA’s backend”. JAX ships a separate plugin, separate wheels, and runs its own four-workflow CI.

How JAX Loads the ROCm Plugin#

Figure 2 traces the path JAX walks on import jax

, ending at a registered RocmDevice

:

Figure 2. The JAX ROCm plugin path — from import jax down to a registered RocmDevice.

Yellow boxes run in the Python interpreter; the green box is the native shared library compiled into the PJRT wheel; the dashed grey box is the bundled fallback used only if neither dedicated plugin is installed. The probes jax_rocm7_plugin

on import, picking up the ROCm 7 plugin automatically when present.

The relevant code lives under:

jax_plugins/rocm/— the Python plugin entry point that registers ROCm withxla_bridge

.jaxlib/rocm/— the native plugin extension (rocm_plugin_extension.cc

) that exposes ROCm-specific FFI types and custom-call handlers across the C ABI.rocm/rocm-jax— AMD’s infrastructure repo, with the Dockerfiles and tooling used to build and ship therocm/jax

images for each ROCm version.

At install time, ROCm support ships as two separate wheels:

Wheel | Contents | |---|---| | The native PJRT C-API plugin ( | | The Python wrapper that JAX’s |

{N}

is the major ROCm version (today 7

). The user-facing install instructions live in docs/installation.md; the Dockerfile-based path lives in rocm/rocm-jax; and a prebuilt image is published as rocm/jax:latest

.

Why two wheels?The split lets AMD ship post-release fixes (.postN

bumps) on the PJRT wheel without forcing a JAX version bump, and lets you co-install multiple ROCm-major-version plugins on the same host without conflicts.

What Landed in ROCm for the Past Year#

The JAX-side work has been similarly active over the past year.

Correctness. AMD contributors landed a Pallas inter-block write race fix for non-range while-loops — a real kernel synchronization bug on ROCm — and added two targeted skips where hipSolver’s semantics diverge from cuSolver: complex paths in testEighIdentity and the

.

tridiagonal_solve_perturbed

path inside eigh

Test infrastructure. ROCm pytest was split TPU-style into single- and multi-accelerator passes (with follow-up parallelization in commit 663efe75a

); each pytest-xdist worker now gets its own GPU through a per-worker HIP_VISIBLE_DEVICES override gated by

JAX_ENABLE_ROCM_XDIST

; and the ROCm build wired up so JAX’s ROCm CI can pin against XLA HEAD instead of JAX’s own XLA pin.

clone_main_xla

plumbingWheels and packaging. ROCm wheels moved off direct S3 to a CloudFront-backed CDN, auditwheel

was taught to accept manylinux_2_28 — opening the door to install on a much wider set of Linux distributions out of the box — and

to track the ROCm-side updates.

rules_ml_toolchain

was bumpedWorkflow hygiene. The ROCm jobs in bazel_rocm.yml and the wheel-download composite action carry explicit

zizmor

overrides where the linter’s defaults conflicted with what the ROCm pipeline actually needs to do.The pattern is consistent: correctness fixes, production-grade packaging, and CI plumbing that lets ROCm-side and XLA-side changes ride the same trains as everything else. ROCm is being maintained as a first-class target, not a side branch.

How ROCm Gets Tested in `jax-ml/jax`

#

JAX runs four ROCm GitHub Actions workflows:

Workflow | Purpose | Hardware | |---|---|---| Full Bazel test sweep on RBE | 1- and 4-GPU AMD pools | | Lightweight presubmit gate | Single AMD GPU | | Python-level pytest with multi-accelerator separation | 1- / 4- / 8-GPU AMD | | Builds the | manylinux_2_28 builder |

The runner-side scripts live in jax/ci/:

run_bazel_test_rocm_rbe.sh— the Bazel-RBE entry point. HonorsJAXCI_CLONE_MAIN_XLA=1

to swap in an XLA-HEAD checkout via--override_repository=xla=…

, which is how OpenXLA PRs pre-flight against JAX before merge.run_pytest_rocm.sh— the pytest entry point. Single-accelerator tests run under pytest-xdist withJAX_ENABLE_ROCM_XDIST

set to the GPU count; multi-accelerator tests (-m "multiaccelerator"

) run serially with the full GPU set.build_rocm_artifacts.sh— drivespython build/build.py --wheels=jax-rocm-plugin,jax-rocm-pjrt

and runsauditwheel

for manylinux compliance.upload_rocm_logs.sh— ships test logs to S3/CloudFront for triage.

Containers used:

ghcr.io/rocm/jax-dev-ubu24.rocm720:latest

for Bazel test workflows.ghcr.io/rocm/jax-base-ubu24.rocm720:latest

for pytest workflows (runtime-trimmed image).ghcr.io/rocm/jax-manylinux_2_28-rocm-7.2.0:latest

for wheel building.

Default coverage in upstream CI today: ROCm 7.2.x; Python 3.11 through 3.14; MI350 (gfx950

) for pytest_rocm.yml

and MI300 in CPX mode (gfx942

) for bazel_rocm.yml

. Other Instinct generations (MI200 gfx90a

) and RDNA Radeon targets are exercised in AMD’s downstream CI; upstream coverage expands as runner capacity comes online.

The xdist isolation pattern.The[hook pins each xdist worker to a single physical GPU by setting both]conftest.py

ROCR_VISIBLE_DEVICES

(so ROCr enumerates only that GPU) andHIP_VISIBLE_DEVICES=0

(so HIP doesn’t re-enumerate hidden agents). Without that pairing, multi-process pytest either crashes on contention or silently colocates workers on device 0. Worth borrowing for any multi-process ROCm test harness.

The Integrated CI Pipeline# #

XLA and JAX CI are not independent systems. XLA pre-flights every PR through JAX; JAX can pin against XLA HEAD via Bazel’s --override_repository

. Two workflow cadences ride on top of that coupling:

Nightly— JAX HEAD built and tested against the XLA commit pinned in JAX’sWORKSPACE

. This is the day-to-day regression signal for JAX itself.Continuous— JAX HEAD built and tested against XLA HEAD (JAXCI_CLONE_MAIN_XLA=1

overrides the pin). This is what catches XLA regressions in the window between XLA-pin bumps.

A ToT ROCm axis is being rolled into the same matrix shortly, adding the ROCm release in the container as a third moving piece exercised against JAX HEAD.

The result is one cross-repo pipeline (Figure 3).

Figure 3. How a ROCm PR flows through XLA CI, JAX CI, and shared infrastructure to land on AMD Instinct runners. Dotted arrows are the cross-repo integration edges (XLA pre-flighting JAX, and JAX pinning XLA HEAD).

Blue is a PR trigger; green is XLA CI; yellow is JAX CI; red is shared build and test infrastructure; grey is the physical AMD Instinct runner pool. Solid arrows are intra-workflow control flow. The two dotted arrows are what make this one pipeline instead of two:

XLA → JAX pre-flight. A PR againstopenxla/xla

triggers thejax

job in, which checks out.github/workflows/rocm_ci.yml

jax-ml/jax

and runs JAX’s ownwithrun_bazel_test_rocm_rbe.sh

--override_repository=xla=$GITHUB_WORKSPACE

. An XLA change that would silently break JAX gets a red check before merge.JAX → XLA HEAD pin. SettingJAXCI_CLONE_MAIN_XLA=1

makes the same script clone the latest XLAmain

and override the repo, so nightly JAX runs catch XLA regressions in the window between XLA-pin bumps in JAX’sWORKSPACE

.

Both directions terminate at the same backing systems: a third-party RBE cluster for build and test scheduling, and the AMD Instinct runner pools for actual GPU execution.

Try It Out and Get Involved# #

The fastest path from this post to a JIT-compiled JAX program on Instinct hardware. Pick the entry point that matches your environment.

Path 1 — Docker (Lowest Friction)#

With an AMD Instinct GPU and a host running ROCm-capable kernel modules, the prebuilt JAX-on-ROCm image is the shortest path:

docker pull rocm/jax:latest

docker run -it --rm \
  --device=/dev/kfd --device=/dev/dri \
  --group-add video --shm-size=64G \
  --ipc=host --cap-add=SYS_PTRACE \
  --security-opt seccomp=unconfined \
  rocm/jax:latest

Inside the container:

import jax
import jax.numpy as jnp

print(jax.devices())   # should list ROCm devices

@jax.jit
def f(x):
    return jnp.tanh(x @ x.T)

x = jnp.ones((1024, 1024), dtype=jnp.bfloat16)
print(f(x).block_until_ready().shape)

A RocmDevice

in jax.devices()

confirms that the plugin loaded and that XLA is compiling through the AMDGPU LLVM backend.

Path 2 — pip on a Host with ROCm Installed#

For an existing ROCm 7 install on the host:

pip install --upgrade "jax[rocm7-local]"

This pulls jax

, jax-rocm7-plugin

, and the matching jax-rocm7-pjrt

wheel from PyPI. JAX does not install the ROCm toolkit itself — install the runtime first via the ROCm installation guide. Post-release fixes ship as jax-rocm7-plugin==X.Y.Z.postN

and can be upgraded independently of the JAX version.

Path 3 — Building XLA from Source#

For compiler-side work rather than running workloads:

git clone https://github.com/openxla/xla.git
cd xla
./configure.py --backend=ROCM --rocm_path=/opt/rocm
bazel test --config=rocm //xla/...

The same --config=rocm_rbe

and --config=ci_single_gpu

options that upstream CI uses are available locally; see build_tools/rocm/rocm_xla.bazelrc.

Where to Go from Here#

If you want to… | Start here | |---|---| Read installation specifics | | Understand the XLA build | | Look up ROCm itself | | Watch CI status | | File an XLA bug | | File a JAX bug | | Send a PR | | See AMD’s staging branches |

The highest-leverage contributions from outside AMD, in our experience:

Performance reports with HLO dumps. The dump flags and tooling are documented inanddocs/hlo_dumps.md

. A reproducible HLO module turns “this is slow” into a tractable issue.docs/tools.md

Numerical-divergence reports. A workload that runs cleanly on CUDA but produces different numerics on ROCm is exactly the kind of signal AMD reviewers prioritise — open an issue with a small reproducer.If you run a target outside the default CI matrix (RDNA in particular), reports of what works and what doesn’t directly inform the next CI expansion.gfx

coverage on the long tail.

Every ROCm-touching PR against either repo runs through the workflows above and returns real-hardware results within a couple of hours. That feedback loop is the entire point.

Summary# #

In the past twelve months, AMD contributors and the broader OpenXLA / JAX community landed:

Triton on AMDGPU at feature parity for matmul, scaled-dot, and the AllReduce groundwork.hipBLASLt group-GEMM,** FP8 fast accumulationon ROCm 7, and rocPRIMintegration for batched scans in XLA. Hermetic LLVM**in the XLA build, cleanhipcc

toolchain ordering, andmanylinux_2_28 wheels for JAX.One unified upstream ROCm CI workflow for XLA, and afour-workflow matrix for JAX with single- and multi-accelerator separation and proper xdist isolation.JAX ↔ XLA cross-pin, so changes on either side pre-flight against the other on real Instinct GPUs before merge.

The roadmap ahead:

Triton AllReduce. Complete the four-part series and turn on Triton-fused collectives in production XLA pipelines.Broaden CI coverage past the currentgfx942

,gfx950

, and beyond.gfx950

single-GPU pool to next-generation Instinct parts as they come online, on both XLA and JAX runners.Autotuning corpora for AMD. Extend the persisted-autotuning and collective-perf-table machinery (additions inPR #40653) with AMD-specific tuning data shipped alongside ROCm releases.Multi-host CI. Extend coverage from intra-node 4-GPU collectives to multi-host runs so distributed JAX and XLA workflows are validated end to end.WSL2 graduation. Move JAX-on-WSL2 from experimental () to a tested CI lane.docs/installation.md

In this blog we covered the full ROCm story for OpenXLA and JAX: the backend architecture, a year of upstream contributions, and the CI infrastructure that gates every PR on real Instinct hardware. ROCm is now a first-class OpenXLA target — upstream, gated on real hardware, and visible in every PR. The quick start at the top of this post is the shortest path from here to a JIT-compiled JAX program on an AMD Instinct GPU, and to filing the next bug or PR that moves the stack forward.

Disclaimers# #

Third-party content is licensed to you directly by the third party that owns the content and is not licensed to you by AMD. ALL LINKED THIRD-PARTY CONTENT IS PROVIDED “AS IS” WITHOUT A WARRANTY OF ANY KIND. USE OF SUCH THIRD-PARTY CONTENT IS DONE AT YOUR SOLE DISCRETION AND UNDER NO CIRCUMSTANCES WILL AMD BE LIABLE TO YOU FOR ANY THIRD-PARTY CONTENT. YOU ASSUME ALL RISK AND ARE SOLELY RESPONSIBLE FOR ANY DAMAGES THAT MAY ARISE FROM YOUR USE OF THIRD-PARTY CONTENT.

source & further reading

rocm.blogs.amd.com — original article Accelerating LLM Inference on AMD GPUs with Low-Latency GEMMs Efficient GPU Utilization With Workload Pre-Emption in AMD Resource Manager DP Attention and TBO for DeepSeek-V4 on MI355X