{"slug": "wave-a-universal-gpu-instruction-set-architecture", "title": "Wave – A universal GPU instruction set architecture", "summary": "A new open-source project called WAVE has introduced a vendor-neutral GPU instruction set architecture that allows developers to write GPU code once and run identical binaries on NVIDIA, AMD, Apple, and Intel hardware. The system, built on 11 hardware-invariant primitives across 34,000 lines of Rust code, has been verified on Apple M4 Pro, NVIDIA T4, and AMD MI300X GPUs, achieving up to 3,587 GFLOPS on matrix multiplication. WAVE aims to standardize GPU computing in the same way ARM standardized CPUs, enabling cross-vendor compatibility without code changes.", "body_md": "**The ARM of GPU computing. One binary, any GPU.**\n\nWAVE is a vendor-neutral GPU instruction set architecture. Write GPU code once, run it on NVIDIA, AMD, Apple, and Intel GPUs unchanged. The same binary produces identical results on all four vendors. ARM defines what a CPU is so multiple vendors can build compatible chips. WAVE does the same for GPU computation.\n\n**11** hardware-invariant primitives across 4 GPU vendors**34,000** lines of Rust across 10 crates**618+** unit tests,**102/102** conformance tests passing**Verified** on Apple M4 Pro, NVIDIA T4, and AMD MI300X**3,587 GFLOPS** F32 matrix multiply on M4 Pro (53.5% of Apple MPS)**89.29%** CIFAR-10 accuracy via PyTorch integration, matching native exactly\n\n```\npip install wave-gpu\n```\n\nOr build from source:\n\n```\ngit clone https://github.com/Oabraham1/wave.git\ncd wave\nfor crate in wave-decode wave-asm wave-dis wave-emu wave-compiler wave-metal wave-ptx wave-hip wave-sycl wave-runtime; do\n  (cd $crate && cargo build --release)\ndone\npython\nimport wave_gpu as wg\n\ndevice = wg.device()\nprint(f\"Running on: {device}\")\n\na = wg.array([1.0, 2.0, 3.0, 4.0])\nb = wg.array([5.0, 6.0, 7.0, 8.0])\nout = wg.zeros(4)\n\nprint(f\"a: {a}\")\nprint(f\"b: {b}\")\nSource Code (Python / Rust / C++ / TypeScript)\n  |\n  v\nwave-compiler ──> WAVE Binary (.wbin) ──> wave-emu (reference emulator)\n                        |\n           ┌────────────┼────────────┐\n           v            v            v\n       wave-metal   wave-ptx    wave-hip    wave-sycl\n       (Apple MSL)  (NVIDIA)    (AMD ROCm)  (Intel oneAPI)\n           |            |            |            |\n           v            v            v            v\n        Apple GPU    NVIDIA GPU   AMD GPU    Intel GPU\n```\n\n| Crate | Purpose |\n|---|---|\n`wave-decode` |\nShared instruction decoder and binary format |\n`wave-asm` |\nAssembler (.wave text to .wbin binary) |\n`wave-dis` |\nDisassembler (.wbin binary to .wave text) |\n`wave-emu` |\nReference emulator |\n`wave-compiler` |\nMulti-language compiler (Python/Rust/C++/TS to .wbin) |\n`wave-metal` |\nApple Metal backend |\n`wave-ptx` |\nNVIDIA PTX backend |\n`wave-hip` |\nAMD HIP backend |\n`wave-sycl` |\nIntel SYCL backend |\n`wave-runtime` |\nSDK runtime with in-process compilation and kernel cache |\n`sdk/python` |\nPython SDK (`pip install wave-gpu` ) |\n\nEach crate builds independently. No Cargo workspace.\n\nAuto-tuned results on Apple M4 Pro at 4096x4096 matrix size (MPS baseline: 6,710 GFLOPS):\n\n| Kernel | F32 GFLOPS | F16 GFLOPS | % of MPS |\n|---|---|---|---|\n| Blocked GEMM | 3,587 | 4,049 | 53.5% |\n| Fused GEMM+bias+ReLU | 3,562 | 4,027 | 53.1% |\n| Fused GEMM+bias+GELU | 3,514 | -- | 52.4% |\n\nCross-vendor hardware verification:\n\n| Vendor | GPU | Status |\n|---|---|---|\n| Apple | M4 Pro | Verified |\n| NVIDIA | T4 | Verified |\n| AMD | MI300X | Verified |\n| Intel | Arc | Pending |\n\n[Toward a Universal GPU Instruction Set Architecture: A Cross-Vendor Analysis of Hardware-Invariant Computational Primitives in Parallel Processors](https://doi.org/10.5281/zenodo.19163452)(Zenodo, 2026)- arXiv preprint:\n[2603.28793](https://arxiv.org/abs/2603.28793) - Under review: International Journal of Parallel Programming (IJPP), April 2026\n- Venue targets: ASPLOS 2027, CGO 2026, MLSys, CAV\n\nSee [CONTRIBUTING.md](/Oabraham1/wave/blob/main/CONTRIBUTING.md) for the fork-based workflow, code standards, and testing requirements.\n\nApache License, Version 2.0. See [LICENSE](/Oabraham1/wave/blob/main/LICENSE) for terms.\n\nAsahi Linux reverse engineering team, Dougall Johnson (GPU microarchitecture documentation), AMD GPUOpen, Google Colab (NVIDIA T4 verification), DigitalOcean (AMD MI300X verification).", "url": "https://wpnews.pro/news/wave-a-universal-gpu-instruction-set-architecture", "canonical_source": "https://github.com/Oabraham1/wave", "published_at": "2026-05-26 13:56:41+00:00", "updated_at": "2026-05-26 14:10:51.179937+00:00", "lang": "en", "topics": ["ai-infrastructure", "ai-chips", "ai-tools", "ai-research", "machine-learning"], "entities": ["WAVE", "NVIDIA", "AMD", "Apple", "Intel", "ARM", "PyTorch", "Apple M4 Pro"], "alternates": {"html": "https://wpnews.pro/news/wave-a-universal-gpu-instruction-set-architecture", "markdown": "https://wpnews.pro/news/wave-a-universal-gpu-instruction-set-architecture.md", "text": "https://wpnews.pro/news/wave-a-universal-gpu-instruction-set-architecture.txt", "jsonld": "https://wpnews.pro/news/wave-a-universal-gpu-instruction-set-architecture.jsonld"}}