I spent the past few weeks working on this project, I thought it might be interesting to write up a technical report on it, the motivation, the process, learnings, etc.
DISCLAIMER: all of the code in this repo (github.com/hellas-ai/thunderbolt-ibverbs) is AI-generated (mostly Codex 5.5 and Opus 4.7) — while I made an effort to understand enough of it to keep it on-track, I almost certainly failed in many instances and I'm sure the code contains many false assumptions, hallucinations and plain stupidity. No warranty or guarantee offered, for research use only, not for human consumption.
TL;DR. We write a linux kernel module and userspace shim to pretend our generic usb4 connection is a low-latency, high-performance InfiniBand device and use it to perform distributed inference across two 128GB Strix Halo mini PCs. Basic interop with Apple's native protocol is functional.
~48 Gb/s per direction (~95 Gb/s bidi total) sustainedib_write_bw
, 4-HCA aggregate at 1 MiB / 8 QPs with IOMMU off — vs~2.3 Gb/s
over the onboard 2.5 GbE and~9 Gb/s
for soft-RoCE on top ofthunderbolt-net
at the per-rail level.~7 µs one-wayib_write_lat
at 64 B, single QP — vs~28 µs
over RXE/2.5 GbE and~65 µs
over RXE/TBnet.