Rigel: Reverse-Engineering the Metal 4.1 Tensor Compute Path on the Apple M4 Max GPU

wpnews.pro

cd /news/ai-chips/rigel-reverse-engineering-the-metal-… · home › topics › ai-chips › article

[ARTICLE · art-24820] src=arxiv.org ↗ pub=2026-06-12T04:00Z topic=ai-chips verified=true sentiment=· neutral

Rigel: Reverse-Engineering the Metal 4.1 Tensor Compute Path on the Apple M4 Max GPU

Researchers at an undisclosed institution reverse-engineered Apple's Metal 4.1 tensor compute path on the M4 Max GPU, revealing that the fp8 matmul2d operation is emulated rather than hardware-accelerated. The study, published as a preprint on arXiv, found the operation sustains only 0.94x the throughput of fp16 despite reading half the operand bytes, indicating it functions as a memory-footprint feature rather than a performance one. The findings, which also showed the operation executes entirely on GPU shader cores with no dedicated matrix datapath, enabled researchers to build a hand-fused GEMM kernel that outperformed the decomposed path by 6.5-12.9% in cache-resident regimes.

read1 min views19 publishedJun 12, 2026

arXiv:2606.12765v1 Announce Type: new Abstract: Apple's Metal 4.1 exposes a tensor compute path: the Metal Performance Primitives (MPP) matmul2d operation over cooperative_tensor fragments, whose interface is documented but whose hardware behavior is deliberately hidden. The specification states which data-type rows are supported, never whether they are hardware-accelerated, where the operation physically executes, what its accumulator width is, or how it partitions matrix fragments across threads. We present Rigel, an empirical characterization of this path on a single Apple M4 Max (a pre-neural-accelerator generation). Using a checksum-gated, provenance-tracked microbenchmark harness, Rigel recovers eleven facts the v4.1 specification hides or contradicts. The headline finding: the Metal 4.1 fp8 (E4M3) matmul2d is emulated, not accelerated: it sustains 0.94x the throughput of fp16 despite reading half the operand bytes, so on M4 it is a memory-footprint feature, not a performance feature. We further show, via a three-signal triangulation (throughput ceiling, comparison against simdgroup_matrix, and per-rail power attribution), that matmul2d executes entirely on the GPU shader cores with no dedicated matrix datapath and no evidence of Apple Neural Engine routing; that it accumulates in >=fp32; and we reconstruct the opaque 8x8 cooperative_tensor fragment layout Apple documents nowhere. Acting on the characterization, a hand-fused GEMM + bias + GELU kernel beats the decomposed path by +6.5-12.9% in the cache-resident regime. All findings are reproducible from committed MIT-licensed code and per-cell CSVs.

source & further reading

arxiv.org — original article

~/api · this article 200

$curl api.wpnews.pro/v1/news/rigel-reverse-engineerin…

Read original on arxiv.org → arxiv.org/abs/2606.12765

mentioned entities

Apple

Metal 4.1

M4 Max

Rigel

GPU

Apple Neural Engine

GEMM

GELU

metadata

slugrigel-reverse-engineering-the-metal-4-1-tensor-compute-path-on-the-apple-m4-max

topic#ai-chips

secondary4 topics

sentimentneutral

canonicalarxiv.org

navigation

← prevLinear Coding Sessions

next →Can KKR Outmaneuver One of the B…

── more in #ai-chips 4 stories · sorted by recency

runtimewire.com · 29 Jul · #ai-chips

Echologue stores journal archives locally while routing AI tasks to cloud providers

insideai.news · 29 Jul · #ai-chips

AI Credit Default Swaps Surge as Oracle, Nvidia Bond Insurance Costs Spike

mlq.ai · 29 Jul · #ai-chips

AI chip selloff erases more than $1 trillion as record profits fail to calm investors

unite.ai · 29 Jul · #ai-chips

Eliyan Raises $145M Series C to Move Into Optical Interconnects

── more on @apple 3 stories trending now

wpnews · 16 Jul · #artificial-intelligence

Women entrepreneurs are less likely to leverage AI—but more likely to benefit from it

wpnews · 28 Jul · #large-language-models

How to Download and Run Kimi K3 Open Weights

wpnews · 28 Jul · #artificial-intelligence

How Claude Code and VS Code turned Anthropic from a safety lab into a developer phenomenon

sponsored brought to you by zahid.host 4,200+ EU-deployed projects

reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main

→ Live at https://your-agent.zahid.host ✓

Get free account → Pricing

from €0/mo · no card required