MLX + JACCL: Distributed AI Training Over Thunderbolt 5

Apple released JACCL, an open-source collective communication library for distributed AI training over Thunderbolt 5, at WWDC 2026. The library enables clustering 2–4 Macs with 50–60 Gbps bandwidth and single-digit microsecond latency, allowing distributed inference and fine-tuning without expensive cloud GPUs. JACCL supports tensor and data parallelism, auto-selecting mesh or ring topology, and requires macOS 26.2, Thunderbolt 5 ports, and a fully connected mesh cabling.

Apple shipped the missing piece for Mac-based ML research at WWDC 2026: JACCL, an open-source collective communication library that turns Thunderbolt 5 cables into a high-speed GPU interconnect. Starting with macOS 26.2, you can cluster 2–4 Macs, hit 50–60 Gbps with single-digit microsecond latency, and run distributed inference or fine-tuning that previously required expensive cloud GPUs. The name is a deliberate jab at Nvidia’s NCCL — and the performance backs it up. What Is JACCL? JACCL stands for Jack and Angelos’ Collective Communication Library — pronounced “Jackal,” and yes, the pun on Nvidia’s NCCL https://developer.nvidia.com/nccl is intentional. It’s named after Jack Beasley, who led development of RDMA over Thunderbolt at Apple, and Angelos Katharopoulos from the MLX team. The library is open-source and ships as part of the MLX ecosystem. The key differentiator from older Mac clustering approaches — like Exo over TCP — is latency. Previous ring-based backends carry roughly 300 microseconds of latency per communication operation, tolerable for inference pipelines but painful for tight synchronization during training. JACCL uses RDMA over Thunderbolt 5, dropping that to single-digit microseconds: an order of magnitude improvement that makes gradient synchronization viable at training scale. Zach Mueller of Hugging Face put it plainly after the announcement: “Genuinely happy that apple and co have solved the NCCL solution with RDMA here.” Two Parallelism Modes JACCL supports both main approaches to distributed ML, and choosing the right one matters: Tensor parallelism shards model weights across nodes. Each node holds all layers, but weight tensors are split. Use this when your model is too large to fit in a single Mac’s RAM — 70B+ models that can’t fit on one M3 Ultra without heavy quantization. You get faster inference proportional to node count. Data parallelism runs a full model copy on each node and trains on different data batches simultaneously. Gradients are averaged across nodes after each batch. Use this for fine-tuning 7B–30B models faster. Apple demonstrated 3x training throughput fine-tuning Qwen 3.5 9B across four M3 Ultras using this approach. JACCL auto-selects between mesh topology low latency, best for small messages and tight synchronization and ring topology higher bandwidth, better for shuffling large model weights . Override with --backend jaccl for mesh or --backend jaccl-ring for ring. Hardware Requirements Before touching a config file, confirm you have the right hardware: macOS 26.2 — RDMA over Thunderbolt is not available on earlier releases Thunderbolt 5 ports — Thunderbolt 4 doesn’t support RDMA; this means M3 Ultra, select M3 Pro configs, M4 series, or newer Thunderbolt 5 cables — active cables recommended for runs over 0.8m Fully connected mesh — JACCL requires a direct cable between every pair of nodes: 2 nodes = 1 cable, 3 nodes = 3 cables, 4 nodes = 6 cables The fully connected mesh constraint is the most frustrating limitation. Four nodes already creates a cable management problem, and Apple hasn’t added switch support yet. Plan your physical setup before buying cables. Setting Up a JACCL Cluster The setup has four steps. Step one is the only one that requires a reboot. Step 1: Enable RDMA on each Mac. Boot each machine into macOS Recovery hold the power button on Apple Silicon , open Terminal via Utilities, and run: rdma ctl enable Reboot normally, then verify RDMA is active: ibv devices Your Thunderbolt interfaces should appear in the output. An empty response means RDMA didn’t initialize — check your macOS version first. Step 2: Generate the cluster hostfile. Run mlx.distributed config from your primary node. It probes SSH connectivity to all nodes and maps the Thunderbolt topology automatically: mlx.distributed config --verbose --backend jaccl \ --hosts m3-ultra-1,m3-ultra-2 \ --over thunderbolt \ --auto-setup \ --output jaccl-cluster.json Step 3: Launch your distributed job. Use mlx.launch with the generated hostfile: mlx.launch --verbose --backend jaccl \ --hostfile jaccl-cluster.json \ --env MLX METAL FAST SYNCH=1 \ -- python -m mlx lm.generate \ --model mlx-community/Qwen3-30B-A3B-4bit Set MLX METAL FAST SYNCH=1 . It enables faster GPU-CPU synchronization and makes a measurable difference on latency-sensitive operations. Step 4: Verify the cluster. The --verbose flag on both commands outputs which RDMA devices were detected and which nodes connected successfully. If a node fails to join, MLX reports it before your job starts rather than failing mid-run. Performance and Honest Trade-offs Apple’s benchmarks show 3x inference speedup and 3x training throughput on 4-node clusters. Community RDMA file transfer tests have documented 3.5+ GB/s throughput https://github.com/ml-explore/mlx/issues/3207 consistent with Thunderbolt 5’s ceiling. Latency drops from 300 µs TCP ring to single digits with RDMA — the kind of number that makes gradient synchronization viable at training scale. That said, this is not a cloud GPU replacement for production-scale workloads. A single H100 still outperforms a 4-node Mac cluster on 70B+ training throughput, and at $2.50–$3/hour on-demand you’re not waiting for hardware to amortize. The Mac cluster wins on data sovereignty, zero egress costs, power efficiency, and fixed-cost 24/7 availability. For ML researchers running iterative experiments on 7B–30B models, that’s a defensible trade-off. One honest caveat: JACCL is still rough in places. Active bugs around RDMA initialization are tracked on GitHub issue 2944 https://github.com/ml-explore/mlx/issues/2944 , and macOS 26.2 is currently in developer beta. Don’t plan a production deployment on this today. Getting Started Apple’s official WWDC26 session “Explore distributed inference and training with MLX” https://developer.apple.com/videos/play/wwdc2026/233/ is the right starting point — it covers the complete setup process with live demos and covers tensor vs data parallelism trade-offs in detail. The MLX distributed communication documentation https://ml-explore.github.io/mlx/build/html/usage/distributed.html covers the Python API. If you want a working cluster config, the community repo alexziskind1/mlx-jaccl-cluster https://github.com/alexziskind1/mlx-jaccl-cluster has setup scripts and a tested hostfile template. The ceiling on Mac-based ML research just went significantly higher. A couple of Thunderbolt 5 cables and an afternoon of setup is now the gap between single-node inference and a distributed training cluster. For the right workloads, that’s not a small change.