{"slug": "amd-strix-halo-rdma-cluster-setup-guide", "title": "AMD Strix Halo RDMA Cluster Setup Guide", "summary": "AMD Strix Halo cluster setup guide details how to configure a two-node system linked via Intel E810 RoCE v2 for distributed vLLM inference using Tensor Parallelism. The guide covers hardware prerequisites, host configuration on Fedora 43, and running the cluster with Ray and RCCL. The setup enables low-latency communication between nodes, making them behave like a single machine for large model inference.", "body_md": "This guide details how to configure a two-node **AMD Strix Halo** cluster linked via **Intel E810 (RoCE v2)** for distributed vLLM inference using Tensor Parallelism.\n\n[TL;DR (Quick Start)](#1-tldr-quick-start)[Concepts & Architecture](#2-concepts--architecture)[Hardware Prerequisites](#3-hardware-prerequisites)[Host Configuration (Fedora)](#4-host-configuration-fedora)[Toolbox Installation & Network Verification](#5-toolbox-installation--network-verification)[Running the Cluster](#6-running-the-cluster)[Troubleshooting](#7-troubleshooting)[References & Acknowledgements](#8-references--acknowledgements)\n\n**On Both Nodes:**\n\n**Preparation**:** Install/Update Fedora 43**and the E810 NICs (Check firmware:`ethtool -i <iface>`\n\n).**BIOS/Kernel**: Set iGPU to 512MB and apply kernel params (`iommu=pt`\n\n,`pci=realloc`\n\n, etc.).**SSH**: Configure** passwordless SSH**between nodes.\n\n**Networking**: Assign static IPs (`192.168.100.1`\n\n&`.2`\n\n), set MTU 9000, and trust the interface in firewall.**Install Toolbox**: Run`./refresh_toolbox.sh`\n\n(this automatically installs the container with RDMA support and the custom`librccl.so`\n\npatch).**Run Cluster**:- Run\n`start-vllm-cluster`\n\n. - Select\n**\"2. Start Ray Cluster\"**(Follow prompts using the TUI). - Select\n**\"4. Launch VLLM Serve\"** and choose your model. (Export`HF_TOKEN`\n\nfirst for gated models!)\n\n- Run\n\n**Key Note**: The `refresh_toolbox.sh`\n\nscript detects your Infiniband/RDMA devices and automatically configures the container to expose them.\n\nTo fully utilize the Strix Halo cluster, it is helpful to understand the technologies involved:\n\n**vLLM**: A high-performance inference engine. To run models larger than a single GPU (or APU) can handle, it splits the model using**Tensor Parallelism (TP)**.** Ray**: A distributed computing framework. vLLM uses Ray to** orchestrate**the cluster, manage the \"worker\" processes on each node, and ensure they start up correctly. Ray handles the*control plane*(issuing commands).**RCCL (ROCm Collective Communication Library)**: The AMD equivalent of NVIDIA's NCCL. This library handles the** data plane**—specifically, the extremely fast synchronization of tensor data between GPUs. When TP=2, the two nodes must exchange partial results after*every single layer*of the neural network. This happens thousands of times per second.**RoCE v2 (RDMA over Converged Ethernet)**: The protocol that allows RCCL to write data directly from one Node's memory to the other Node's memory, bypassing the CPU and OS kernel.**Without RDMA**: Latency is ~70-100µs (TCP/IP overhead).** With RDMA**: Latency is ~5µs.** Why it matters**: For interactive token generation, high latency kills performance. RoCE makes the two nodes feel like a single machine.\n\n**Nodes**: 2x[Framework Desktop Mainboards](https://frame.work/gb/en/products/framework-desktop-mainboard-amd-ryzen-ai-max-300-series?v=FRAFMK0006)with AMD Ryzen AI MAX+ \"Strix Halo\", 128GB of Unified Memory.**Network Cards**:[Intel Ethernet Controller E810-CQDA1](https://www.intel.com/content/www/us/en/products/sku/192558/intel-ethernet-network-adapter-e810cqda1/specifications.html)(or similar 100GbE QSFP28).**Connection**: Direct Attach Copper (DAC) cable (e.g.,[QSFPTEK 100G QSFP28 DAC](https://www.amazon.co.uk/dp/B09F32F7VK)). No switch required for 2 nodes.**PCIe Note**: The Framework motherboard PCIe slot is physically** x4**, so a riser is required to plug in a 16x card (e.g.,[CY PCI-E Express 4x to 16x Extender](https://www.amazon.co.uk/dp/B0837FZFJ6)).**Test Setup Note:** One of the boards in this setup has a modified PCIe slot (cut by Framework using an ultrasonic knife) to accept x16 cards directly.**This is not recommended for users.** Risers are the cheaper, safer, and easier solution. Performance is identical (~50Gbps bandwidth, ~5µs latency).\n\nPerform these steps on the **Host OS** (Fedora 43) of **both nodes**.\n\n**Tested Host Configuration:**\n\n| Node | Kernel | OS | IP (RDMA Interface) |\n|---|---|---|---|\nNode 1 |\n`6.18.5-200.fc43.x86_64` |\nFedora Linux 43 | `192.168.100.1/30` |\nNode 2 |\n`6.18.6-200.fc43.x86_64` |\nFedora Linux 43 | `192.168.100.2/30` |\n\nNote:These specific kernel versions were verified to work. Fedora 43 is recommended.\n\nInstall the core RDMA userspace tools. You do **not** need proprietary Intel drivers; the in-kernel drivers work perfectly.\n\n**Ethernet Driver:**`ice`\n\n**RDMA Driver:**`irdma`\n\n(Unified driver for RoCE v2 & iWARP)\n\n```\nsudo dnf install rdma-core libibverbs-utils perftest\n```\n\n`rdma-core`\n\n: The userspace components for the RDMA subsystem (libraries, daemons, and configuration tools).`libibverbs-utils`\n\n: Utilities for querying RDMA devices (e.g.,`ibv_devinfo`\n\n).`perftest`\n\n: A suite of benchmarks (e.g.,`ib_write_bw`\n\n,`ib_send_lat`\n\n) to verify RDMA bandwidth and latency.\n\nUse `ethtool`\n\nto check the current firmware version of your Intel E810 card.\n\n```\nethtool -i enp194s0np0\n```\n\n**Recommended Firmware:**\nEnsure your firmware is at least as new as the version shown below (Firmware `4.91...`\n\n). If your firmware is older, please update it using the [Intel® Ethernet NVM Update Tool for E810 Series](https://www.intel.com/content/www/us/en/download/19624/non-volatile-memory-nvm-update-utility-for-intel-ethernet-network-adapter-e810-series-linux.html).\n\n**Example Output:**\n\n```\ndriver: ice\nversion: 6.18.5-200.fc43.x86_64\nfirmware-version: 4.91 0x800214b5 1.3909.0\nexpansion-rom-version: \nbus-info: 0000:c2:00.0\nsupports-statistics: yes\nsupports-test: yes\nsupports-eeprom-access: yes\nsupports-register-dump: yes\nsupports-priv-flags: yes\n```\n\nThis guide assumes a subnet of `192.168.100.0/30`\n\n.\n\n**Identify your interface**:\nRun `ip link`\n\nto find your 100GbE card (e.g., `enp194s0np0`\n\n).\n\n**Node 1 (Head - 192.168.100.1):**\n\n```\n# Bring link up\nsudo ip link set enp194s0np0 up\n\n# Assign IP\nsudo ip addr add 192.168.100.1/30 dev enp194s0np0\n\n# Set MTU (Jumbo Frames)\nsudo nmcli connection modify \"rdma0\" ethernet.mtu 9000\nsudo nmcli connection up \"rdma0\"\n```\n\n**Node 2 (Worker - 192.168.100.2):**\n\n```\n# Bring link up\nsudo ip link set enp194s0np0 up\n\n# Assign IP\nsudo ip addr add 192.168.100.2/30 dev enp194s0np0\n\n# Set MTU\nsudo nmcli connection modify \"rdma0\" ethernet.mtu 9000\nsudo nmcli connection up \"rdma0\"\n```\n\n**Verify Routing:**\nEnsure the route exists on both:\n\n```\nsudo ip route add 192.168.100.0/30 dev enp194s0np0\n```\n\n**Verify Link:**\n\n```\nrdma link\n# Output should show: state ACTIVE physical_state LINK_UP used_usec X ...\n```\n\n**1. BIOS Settings:**\nSet the **iGPU Memory Allocation** to the **minimum possible (512MB)**. We will use the GTT (Graphics Translation Table) to dynamically allocate system memory as \"Unified Memory\" for the GPU.\n\n**2. Kernel Parameters:**\nUpdate GRUB to enable unified memory, optimize RDMA performance, and fix PCI resource allocation.\n\nEdit `/etc/default/grub`\n\nand append to `GRUB_CMDLINE_LINUX`\n\n:\n\n```\niommu=pt pci=realloc pcie_aspm=off amdgpu.gttsize=126976 ttm.pages_limit=32505856\n```\n\n**Explanation of Parameters:**\n\n`iommu=pt`\n\n: Sets IOMMU to \"Pass-Through\" mode. This is critical for performance, reducing overhead for both the RDMA NIC and the iGPU unified memory access.`pci=realloc`\n\n: Reallocates PCI BARs. Often needed on consumer platforms to properly map large address spaces for devices like the E810 or Strix Halo.`pcie_aspm=off`\n\n: Disables PCIe Active State Power Management. Prevents latency spikes and link negotiation issues on the 100GbE connection.`amdgpu.gttsize=126976`\n\n: Caps the GPU GTT size to ~124GiB (126976MB). This defines how much system RAM the GPU can address as its own \"VRAM\".`ttm.pages_limit=32505856`\n\n: Limits the Translation Table Manager to ~124GiB (in 4KB pages), matching the GTT size.\n\n**3. Apply Changes:**\n\n```\nsudo grub2-mkconfig -o /boot/grub2/grub.cfg\nsudo reboot\n```\n\nApplications like Ray and NCCL use random high ports. It is easiest to trust the internal RDMA interface completely.\n\n```\n# Assign the interface to the trusted zone permanently\nsudo firewall-cmd --permanent --zone=trusted --add-interface=enp194s0np0\n\n# Reload firewall\nsudo firewall-cmd --reload\n```\n\nThe cluster management and verification scripts rely on SSH to execute commands on remote nodes. You must configure **passwordless SSH** between both nodes (root or sudo-enabled user).\n\n**Guide:**[How to Set Up SSH Keys on Linux (DigitalOcean)](https://www.digitalocean.com/community/tutorials/how-to-set-up-ssh-keys-on-ubuntu-20-04)**Quick Check:** Run`ssh <other-node-ip> date`\n\nfrom each node. It should print the date without asking for a password.\n\nThe toolbox container provided in this repo includes a **critical patch**: a custom-built `librccl.so`\n\nthat enables `gfx1151`\n\n(Strix Halo) support for RDMA ([https://github.com/kyuz0/rocm-systems/tree/gfx1151-rccl](https://github.com/kyuz0/rocm-systems/tree/gfx1151-rccl)), which is currently missing in upstream ROCm packages. This library is automatically compiled using the [ build-rccl](/kyuz0/amd-strix-halo-vllm-toolboxes/blob/main/.github/workflows/build-rccl.yml) GitHub Action in this repository, which generates the artifact that is then bundled into the Docker container.\n\nTo install the toolbox on **both nodes**, run:\n\n```\n./refresh_toolbox.sh\n```\n\n**What this does:**\n\n- Pulls the latest\n`kyuz0/vllm-therock-gfx1151`\n\nimage. - Detects if\n`/dev/infiniband`\n\nexists on your host. - Creates the toolbox with flags to expose:\n**iGPU Access**:`/dev/dri`\n\n,`/dev/kfd`\n\n(Required for ROCm)**RDMA Access**:`/dev/infiniband`\n\n,`--group-add rdma`\n\n**Memory Pinning**:`--ulimit memlock=-1`\n\n(Required for DMA)\n\nBefore proceeding to run the cluster, verify that RDMA is active and providing low latency (~5µs vs ~70µs for Ethernet).\n\nRun the provided verification script from the **Head Node**:\n\n```\n# Inside toolbox\n/opt/compare_eth_vs_rdma.sh\n```\n\n**Expected Results:**\n\n```\nPath                 Latency      Bandwidth   \n------------------------------------------------\nEthernet (1G LAN)    0.074 ms     0.94 Gbps   \nEthernet (RoCE NIC)  0.068 ms     55.70 Gbps  \nRDMA (RoCE)          5.23 us      50.64 Gbps\n```\n\n*Note the massive latency drop (milliseconds to microseconds) for RDMA.*\n\nA TUI utility, `start-vllm-cluster`\n\n, is provided to manage the Ray cluster and vLLM.\n\n**Enter the toolbox**:\n\n```\ntoolbox enter vllm\n```\n\n**Run the Cluster Manager**:\n\n```\nstart-vllm-cluster\n```\n\n**Configure IPs**(Option 1):- Ensure Head is\n`192.168.100.1`\n\nand Worker is`192.168.100.2`\n\n.\n\n- Ensure Head is\n**Start Ray Cluster**(Option 2):** On Node 1**: Select**\"Head\"** when prompted.**On Node 2**: Select**\"Worker\"** when prompted.- The script effectively runs:\n\n```\n# Head\nexport NCCL_SOCKET_IFNAME=<rdma_iface>\nray start --head --node-ip-address=192.168.100.1 ...\n\n# Worker\nray start --address=192.168.100.1:6379 ...\n```\n\n**Check Status**(Option 3):- Ensure you see\n**2 nodes** and adequate GPU resources (e.g.,`2.0 GPU`\n\n).\n\n- Ensure you see\n\nOnce the cluster is active (checked via Option 3):\n\n- Select\n**\"4. Launch VLLM Serve\"** in the TUI. - Choose a model (e.g.,\n`Meta-Llama-3.1-8B-Instruct`\n\n). **Configuration Menu**:** Tensor Parallelism**: Set to`2`\n\n(one GPU per node).**Context Length**: Auto or custom (e.g.,`131072`\n\n).**Erase vLLM Cache**: Select`YES`\n\nif you are restarting after a crash.**Force Eager Mode**: Select`YES`\n\n.*Why?*CUDA Graphs can be unstable on distributed APU clusters and cause deadlocks. Eager mode is safer, but you might be able to squeeze 1-3% more performance if you take a chance and disable it.\n\n**Launch**: Select \"LAUNCH SERVER\".\n\n**Important Gotchas:**\n\n**First Run Download**: When running a model for the first time, each node in the cluster must download the weights independently. This may take some time depending on your internet connection.**Gated Models (e.g., Gemma)**:- Models like\n`google/gemma-2-27b-it`\n\nare \"gated\" and require you to request access on Hugging Face. - You must export your Hugging Face token before running the cluster script:\n\n```\nexport HF_TOKEN=your_token_here\nstart-vllm-cluster\n```\n\n- If you don't provide a token or haven't accepted the license on Hugging Face, the download will fail.\n\n- Models like\n\n**Cause**: CUDA Graph capture can freeze on distributed APU nodes.** Fix**: Enable**\"Force Eager Mode\"** in the start menu.\n\nIf you see link issues, ensure your Intel E810 firmware is up to date using the Intel standard tools.\n\n**Reddit - Strix Halo Batching with Tensor Parallel**:[Thread by Hungry_Elk_3276](https://www.reddit.com/r/LocalLLaMA/comments/1p8nped/strix_halo_batching_with_tensor_parallel_and/)- Special thanks to user\n**Hungry_Elk_3276** for their initial experiments with vLLM RDMA, which highlighted the missing`gfx1151`\n\nsupport in upstream RCCL.\n\n- Special thanks to user\n\nIf you do not have dedicated 100GbE RDMA network cards, you can directly connect the two nodes using a high-quality **Thunderbolt 4 / USB4 cable**. This will create a `thunderbolt0`\n\nnetwork interface.\n\nWhile it lacks the ultra-low microprocessor-level latency of RDMA, it provides significantly more bandwidth than standard 1GbE/5GbE Ethernet and is easier to configure.\n\nNote:`thunderbolt-net`\n\nrelies on standard OS kernel TCP/IP stacks.\n\n**1. Establish Connection:**\nConnect the nodes directly using a certified Thunderbolt 4 or USB4 cable. Verify the link is active:\n\n```\nip link show thunderbolt0\n```\n\n**2. Network Configuration (Head - Node 1):**\nConfigure a persistent connection using `nmcli`\n\nwith a static IP and Jumbo Frames (reduces CPU overhead).\n*Note: Jumbo Frames may be unsupported on some Thunderbolt host controllers.*\n\n```\nsudo nmcli connection add type ethernet ifname thunderbolt0 con-name thunderbolt0 ipv4.method manual ipv4.addresses 192.168.2.1/24 mtu 9000\nsudo nmcli connection up thunderbolt0\n```\n\n**3. Network Configuration (Worker - Node 2):**\n\n```\nsudo nmcli connection add type ethernet ifname thunderbolt0 con-name thunderbolt0 ipv4.method manual ipv4.addresses 192.168.2.2/24 mtu 9000\nsudo nmcli connection up thunderbolt0\n```\n\n**4. Firewall Rules:**\nTo ensure Ray and NCCL can communicate freely over this link:\n\n```\n# Assign the interface to the trusted zone permanently\nsudo firewall-cmd --permanent --zone=trusted --add-interface=thunderbolt0\nsudo firewall-cmd --reload\n```\n\nOur cluster scripts dynamically detect the network interface based on the provided IPs. There is no need to manually export environment variables!\n\n- Open the Toolbox:\n`toolbox enter vllm`\n\n- Launch the cluster manager:\n`start-vllm-cluster`\n\n- Select\n**Option 1 (Configure IPs)**. - Set the\n**Head IP** explicitly to`192.168.2.1`\n\nand the**Worker IP** to`192.168.2.2`\n\n. - Start the cluster normally (Option 2). The script will automatically discover and utilize\n`thunderbolt0`\n\nas the backend network for Ray orchestration and GPU synchronization.\n\nI have added Thunderbolt support to the `compare_eth_vs_rdma.sh`\n\nscript. Run it from inside the toolbox to see the latency and bandwidth of your Thunderbolt link compared to your other network interfaces.\n\nYou can use the `-t`\n\nflag to ONLY benchmark the Thunderbolt connection (or `-e`\n\n, `-r`\n\n, `-i`\n\nfor the others):\n\n```\n/opt/compare_eth_vs_rdma.sh -t\n```\n\n", "url": "https://wpnews.pro/news/amd-strix-halo-rdma-cluster-setup-guide", "canonical_source": "https://github.com/kyuz0/amd-strix-halo-vllm-toolboxes/blob/main/rdma_cluster/setup_guide.md", "published_at": "2026-06-28 00:46:52+00:00", "updated_at": "2026-06-28 01:04:29.774924+00:00", "lang": "en", "topics": ["artificial-intelligence", "large-language-models", "ai-infrastructure", "ai-tools", "ai-research"], "entities": ["AMD", "Intel", "Framework Desktop", "vLLM", "Ray", "RCCL", "RoCE v2", "Fedora"], "alternates": {"html": "https://wpnews.pro/news/amd-strix-halo-rdma-cluster-setup-guide", "markdown": "https://wpnews.pro/news/amd-strix-halo-rdma-cluster-setup-guide.md", "text": "https://wpnews.pro/news/amd-strix-halo-rdma-cluster-setup-guide.txt", "jsonld": "https://wpnews.pro/news/amd-strix-halo-rdma-cluster-setup-guide.jsonld"}}