Show HN: Alloy – a PyTorch backend and inference engine for Apple Silicon

Alloy, a new PyTorch backend and inference engine for Apple Silicon, has been released as a technical preview. The open-source project compiles Python GPU kernels to Metal and supports LLM serving with a drop-in compatible API for OpenAI, Anthropic, and Ollama clients. Alloy includes a torch.compile backend and features like warm-prefix KV reuse, on-GPU sampling, and speculative decoding.

Kernel authoring DSL, torch.compile backend and LLM serving for Apple Silicon. Alloy is a compiler and runtime for GPU compute kernels on Apple Silicon. You write kernels in Python. Alloy compiles them to Metal through a tile IR pipeline; covering everything from per-thread scalar kernels to cooperative tiled GEMM with simdgroup MMA and automatic operator fusion for multi-kernel pipelines. Status : technical preview. Requires Apple Silicon M1+ and macOS 13+. The Python packages need Python 3.10–3.12. Install install Inference server - Quickstart inference-server---quickstart torch.compile backend torchcompile-backend Benchmarks benchmarks Writing kernels writing-kernels Why Alloy why-alloy Contributing contributing License license Python pip / uv pip install 'alloy-kit serve ' local LLM server + CLI + torch.compile backend pip install alloy-kit lean: just the GPU kernel compiler no torch pip install 'alloy-kit all ' + training / vision / audio research extras import alloy as al The PyPI distribution is alloy-kit . The brackets are optional dependency groups: the lean base provides @al.kernel with the tile IR, MSL emitter and Metal dispatch machinery, and serve adds everything needed to run the server and the alloy CLI. Standalone no Python required : curl -fsSL https://raw.githubusercontent.com/rayanht/alloy/main/installer/install.sh | sh Installs a self-contained alloy CLI into /usr/local . From source contributors : see Contributing contributing . Alloy serves a loopback HTTP API that's drop-in compatible with the OpenAI, Anthropic and Ollama clients. Important Run alloy tune <model before serving for optimal performance Start the server in the foreground; loads the model from a local Ollama cache or Hugging Face if present. alloy serve -m qwen3:0.6b Ollama tag alloy serve -m bartowski/Llama-3.2-3B-Instruct-GGUF:Q4 K M HF model OpenAI: curl http://127.0.0.1:11434/v1/chat/completions \ -H 'content-type: application/json' \ -d '{"model":"qwen3:0.6b","messages": {"role":"user","content":"hi"} }' Ollama: curl http://127.0.0.1:11434/api/chat \ -d '{"model":"qwen3:0.6b","messages": {"role":"user","content":"hi"} }' Claude Code alloy launch claude The default port is 11434 . Pass --port 11435 to alloy serve or set ALLOY PORT to override. | Feature | Status | |---|---| | Warm-prefix KV reuse bookmarks + branching | Stable | | On-GPU sampling temp / top-p / top-k / min-p / seed | Stable | | Constrained decoding xgrammar JSON + tool grammars | Stable | | Tool calling OpenAI / Anthropic / Ollama, per-family parsers | Stable | | Reasoning / thinking split | Stable | | MoE inference | Stable Qwen3.5-MoE | | Vision input | Stable gemma4 | | Audio input | Stable gemma4 | | Embeddings | Stable nomic-embed-text | | Speculative decoding — PLD prompt lookup | Opt-in --spec pld | | Speculative decoding — MTP | Opt-in --spec mtp , Qwen3.5 | | Speculative decoding — DFlash block diffusion | Opt-in --spec dflash | | Paged KV cache | Opt-in ALLOY KV=paged | | KV cache quantization int8 + fp16 scales | Opt-in --kv-quant q8 0 | Supported quantizations Model weights | source | format | supported | |---|---|---| | GGUF | Q4 K Q4 K M / Q4 K S | ✅ | | GGUF | Q5 0 | ✅ | | GGUF | Q6 K | ✅ | | GGUF | Q8 0 | ✅ | | GGUF | F16 / BF16 / F32 | ✅ | | GGUF | Q2 K / Q3 K / Q5 K | ❌ | | GGUF | Q4 0 / Q4 1 / Q5 1 | ❌ | | GGUF | IQ1 / IQ2 / IQ3 / IQ4 IQ4 XS, IQ4 NL | ❌ | | MLX | 4-bit affine group size 64 / 128 | ✅ | | MLX | 2-bit / 3-bit / 6-bit / 8-bit | ❌ | KV cache | format | supported | |---|---| | fp16 default | ✅ | | q8 0 | ✅ | | q4 / other | ❌ | Alloy includes a torch.compile backend that compiles covered PyTorch FX graphs to fused Metal compute kernels. python import torch import transformers import alloy torch registers the "alloy" backend model = transformers.AutoModelForCausalLM.from pretrained "gpt2" .eval compiled = torch.compile model, backend="alloy" input ids = torch.randint 0, model.config.vocab size, 1, 16 output = compiled input ids=input ids The backend handles: FX graph decomposition, operator fusion RMSNorm, RoPE, GELU, batched QKV, GEMM+LayerNorm, scalar broadcast , GQA-native attention, compiled dispatch plans, and tuning. Runnable model examples live in examples/torch/ /rayanht/alloy/blob/main/examples/torch : — multi-layer perceptron Linear / LayerNorm / GELU mlp.py — GroupNorm ResNet Conv2d + residual blocks resnet.py — pre-norm encoder block SDPA + GELU MLP transformer.py A full torch.compile training step forward, backward, and the optimizer update runs end to end through Alloy and matches PyTorch eager within floating-point tolerance for dense transformer-style models: embeddings, linear layers, normalization, residual blocks, attention, cross-entropy, and the common optimizers SGD, Adam, AdamW, RMSprop . A small language model trains end to end, and LoRA fine-tuning of a pretrained transformer works in model.train . Enable it before torch.compile : python import torch import torch.nn as nn import torch.nn.functional as F import alloy torch registers the "alloy" backend from alloy torch.training import set training mode set training mode True before torch.compile model = nn.Sequential nn.Linear 64, 128 , nn.GELU , nn.Linear 128, 1 step = torch.compile model, backend="alloy" opt = torch.optim.AdamW model.parameters , lr=0.05 x, y = torch.randn 32, 64 , torch.randn 32, 1 for in range 20 : opt.zero grad loss = F.mse loss step x , y loss.backward opt.step Fine-tuning a pretrained transformer with PEFT https://github.com/huggingface/peft LoRA is the same shape: python import peft import transformers model = peft.get peft model transformers.AutoModelForCausalLM.from pretrained "gpt2" , peft.LoraConfig target modules= "c attn" , task type="CAUSAL LM" , step = torch.compile model, backend="alloy" opt = torch.optim.AdamW p for p in model.parameters if p.requires grad , lr=5e-3 model.train for input ids in batches: opt.zero grad loss = step input ids=input ids, labels=input ids .loss loss.backward opt.step Runnable training examples live in examples/torch/ /rayanht/alloy/blob/main/examples/torch : — MLP regression Linear / LayerNorm / GELU, AdamW train mlp.py — transformer block + cross-entropy SGD train transformer.py — tiny language model Embedding + attention + cross-entropy train lm.py — LoRA fine-tuning of gpt2 PEFT + transformers finetune lora.py It is still a preview. The backward pass does not yet cover convolutions or pooling, so CNN training is not supported. Inference is the primary, fully validated path. Reproduce with alloy bench <HF OR OLLAMA TAG | model | quant | pp512 | tg128 | |---|---|---|---| | LFM2.5-1.2B-Instruct-GGUF | Q4 K M | 4222 | 508 | | bartowski/Llama-3.2-3B-Instruct-GGUF | Q4 K M | 2061 | 198 | | Qwen Qwen3-0.6B-GGUF | Q4 K M | 8311 | 612 | | model | quant | pp512 | tg128 | |---|---|---|---| | qwen2.5:0.5b | Q4 K M | 12102 | 505 | | qwen3:0.6b | Q4 K M | 10077 | 584 | | llama3.2:1b | Q8 0 | 5653 | 324 | | qwen3.5:0.8b | Q8 0 | 6141 | 349 | | deepseek-r1:1.5b | Q4 K M | 3295 | 274 | | qwen2.5:1.5b | Q4 K M | 3295 | 270 | | qwen3.5:2b | Q8 0 | 3247 | 187 | | gemma4:e2b | Q4 K M | 2121 | 175 | | qwen2.5:3b | Q4 K M | 1617 | 185 | | gemma4:e4b | Q4 K M | 1079 | 115 | | qwen3.5:4b | Q4 K M | 1098 | 122 | | qwen3.5:9b | Q4 K M | 598 | 78.6 | | qwen3.6:35b | Q4 K M | 988 | 121 | | model | quant | pp512 | tg128 | |---|---|---|---| | Qwen/Qwen3-0.6B-MLX-4bit | 4-bit g128 | 10063 | 710 | | LiquidAI/LFM2.5-1.2B-Instruct-MLX-4bit | 4-bit g64 | 5688 | 589 | | mlx-community/Llama-3.2-3B-Instruct-4bit | 4-bit g64 | 2173 | 220 | | mlx-community/Qwen3-4B-4bit | 4-bit g64 | 1673 | 174 | | mlx-community/Qwen3-8B-4bit | 4-bit g64 | 866 | 102 | | model | vision ms | alloy TTFT | alloy dec | alloy wall | |---|---|---|---|---| | gemma4:e2b | 229 | 455 | 172 | 1193 | | gemma4:e4b | 257 | 665 | 99.0 | 1949 | Per-regime encoder tok/s from alloy bench nomic-embed-text --dataset embeddings . | regime | batch | seq | tok/s | |---|---|---|---| | q short | 1 | 10 | 5094 | | q long | 1 | 256 | 19161 | | b8 short | 8 | 10 | 14142 | | b8 long | 8 | 128 | 11840 | python import numpy as np import alloy as al @al.kernel def blur src, dst: al.output, W: al.constexpr, H: al.constexpr : x = al.program id 0 y = al.program id 1 acc = 0.0 count = 0 for dy in range -1, 2 : for dx in range -1, 2 : nx = x + dx ny = y + dy if nx = 0: if nx < W: if ny = 0: if ny < H: acc = acc + al.load src + ny W + nx count = count + 1 al.store dst + y W + x, acc / count W, H = 1920, 1080 img = np.random.rand H, W .astype np.float32 out = blur W, H img.ravel , W=W, H=H print np.asarray out .reshape H, W NumPy and PyTorch arrays can be bound directly as inputs for covered contiguous host-memory paths. The kernel's al.output is allocated automatically and returned as an AlloyBuffer — convert with np.asarray ... or .numpy . Some interop paths allocate Alloy-owned shared buffers or require layout copies, so this is not a blanket promise that every input type and view is no-copy. More runnable examples live in examples/kernel/ /rayanht/alloy/blob/main/examples/kernel : — masked elementwise add vector add.py — fused GELU / sigmoid / SiLU elementwise.py — row-wise softmax, manual vs. builtin softmax.py — 2D box blur shown above blur.py — naive vs. tiled GEMM with simdgroup MMA matmul.py — atomics histogram.py — N-body simulation nbody.py — divergent per-thread iteration mandelbrot.py — online-softmax attention flash attention.py python @al.kernel def matmul A, B T, C: al.output, BLOCK M: al.constexpr = 64, BLOCK N: al.constexpr = 64, BLOCK K: al.constexpr = 16 : M, K = A.shape N = B T.shape 0 pm = al.program id 0 pn = al.program id 1 rm = pm BLOCK M + al.arange 0, BLOCK M rn = pn BLOCK N + al.arange 0, BLOCK N rk = al.arange 0, BLOCK K a ptrs = A + rm :, None K + rk None, : b ptrs = B T + rn :, None K + rk None, : acc = al.zeros BLOCK M, BLOCK N , dtype=al.float32 for k in range 0, K, BLOCK K : a = al.load a ptrs, mask= rm :, None < M & rk None, : < K b = al.load b ptrs, mask= rn :, None < N & rk None, : < K acc += al.tile dot a, b, transpose rhs=True a ptrs += BLOCK K b ptrs += BLOCK K al.store C + rm :, None N + rn None, : , acc, mask= rm :, None < M & rn None, : < N This compiles to Metal with simdgroup matrix multiply-accumulate MMA , cooperative tile loads, threadgroup shared memory staging, and optional double buffering all generated automatically from the tile IR. High-performance implementations of common operations. C = al.dot transpose rhs A, B tiled GEMM with autotuning s = al.softmax x fused row-wise softmax y = al.layernorm x, gamma, beta fused layer normalization y, = al.rms norm x, weight fused RMS normalization + per-row 1/rms L = al.cross entropy logits, labels fused cross-entropy loss kernel Builtins infer output shapes and constexpr values from input arrays. They compose with fusion. e.g. al.dot followed by an elementwise kernel automatically fuses the elementwise op as an epilogue. Grid and thread indexing pid = al.program id 0 threadgroup position block index tid = al.thread id thread position within threadgroup offs = pid 1024 + al.arange 0, 1024 block-level offsets Memory x = al.load ptr + offs, mask=mask masked global load al.store ptr + offs, val, mask=mask masked global store buf = al.shared 256 threadgroup shared memory loc = al.local 8 per-thread register array al.barrier threadgroup memory barrier al.coop load buf, src ptr, size cooperative threadgroup load + barrier al.copy4 dst, offset, src ptr vectorized 4-element load Tile operations 2D blocks for GEMM, attention, etc. acc = al.zeros BLOCK M, BLOCK N , dtype=al.float32 acc += al.tile dot a, b, transpose rhs=True simdgroup MMA reduced = al.simd reduce val cross-lane reduction Simdgroup warp-level al.simd shuffle xor val, offset butterfly shuffle al.simd shuffle val, lane read from specific lane acc = al.simd matrix 8x8 matrix accumulator al.simd load src, offset, stride load into simd matrix al.simd mma acc, a, b matrix multiply-accumulate Atomics al.atomic add ptr, idx, val atomic fetch-and-add int32 al.atomic max ptr, idx, val al.atomic cas ptr, idx, expected, desired compare-and-swap Control flow — plain Python if cond: ... for i in range N : ... while cond: ... These three kernels fuse into one — no intermediate buffers allocated. Each call returns a lazy AlloyBuffer; feed it straight into the next: t1 = scale grid x, N=N t1 = x 2.0 t2 = bias grid t1, N=N t2 = t1 + 1.0 result = activate grid t2, N=N result = relu t2 Reading the result triggers one fused GPU submission: print result 0 Pass PyTorch tensors or MLX arrays directly when their storage layout is supported: python import torch x = torch.randn 32, 128 CPU tensor, lives in unified memory result = my kernel grid x, M=32, N=128 x bound directly; result returned as an AlloyBuffer Alloy's compiled plans may convert PyTorch input storage to Alloy-owned shared memory on first execution so subsequent dispatches can resolve Metal buffers by handle. That keeps subsequent dispatches free of per-call input copies for stable storage. al.inspect my kernel, N=8192 prints MSL source al.inspect my kernel, level="tile-ir", N=8192 prints tile IR The problem. Metal compute is powerful but painful to program. You write MSL in a C++ dialect, manually manage buffer bindings, compile pipeline state objects, and set up command encoders. There's no equivalent of Triton, Numba, or CuPy for Metal. What Alloy does. Python → tile IR → MSL, with a runtime that handles dispatch, caching, and optimization: Shared-memory execution — Apple Silicon CPU and GPU share physical memory. Alloy binds caller buffers directly where the storage layout supports it, and uses Alloy-owned shared buffers when plan safety or alignment requires it. Tile IR compiler — Python kernel source → AST → tile IR loads, stores, reductions, MMA, barriers → Metal Shading Language. Handles threadgroup sizing, shared memory allocation, simdgroup decomposition, and barrier placement automatically. Automatic dispatch — builtins return lazy buffers that queue GPU work automatically. Reading results triggers a single fused Metal command buffer commit. No manual batch management needed. Operator fusion — adjacent elementwise kernels fuse automatically, eliminating intermediate buffers. Elementwise ops fuse as prologues and epilogues into reductions, GEMM, softmax, and layernorm. Transposes fuse via stride absorption. Tuning — exhaustive search over tile sizes, loop unrolling, double buffering, and matvec strategies. See CONTRIBUTING.md /rayanht/alloy/blob/main/CONTRIBUTING.md for dev setup, test commands, and PR conventions.