# Show HN: Alloy – a PyTorch backend and inference engine for Apple Silicon > Source: > Published: 2026-06-20 21:32:51+00:00 Kernel authoring DSL, `torch.compile` backend and LLM serving for Apple Silicon. Alloy is a compiler and runtime for GPU compute kernels on Apple Silicon. You write kernels in Python. Alloy compiles them to Metal through a tile IR pipeline; covering everything from per-thread scalar kernels to cooperative tiled GEMM with simdgroup MMA and automatic operator fusion for multi-kernel pipelines. **Status**: technical preview. Requires Apple Silicon (M1+) and macOS 13+. The Python packages need Python 3.10–3.12. [Install](#install)[Inference server - Quickstart](#inference-server---quickstart)[torch.compile backend](#torchcompile-backend)[Benchmarks](#benchmarks)[Writing kernels](#writing-kernels)[Why Alloy](#why-alloy)[Contributing](#contributing)[License](#license) **Python (pip / uv)** ``` pip install 'alloy-kit[serve]' # local LLM server + CLI + torch.compile backend pip install alloy-kit # lean: just the GPU kernel compiler (no torch) pip install 'alloy-kit[all]' # + training / vision / audio research extras # import alloy as al ``` The PyPI distribution is ** alloy-kit**. The brackets are optional dependency groups: the lean base provides `@al.kernel` with the tile IR, MSL emitter and Metal dispatch machinery, and `[serve]` adds everything needed to run the server and the `alloy` CLI.**Standalone (no Python required):** ``` curl -fsSL https://raw.githubusercontent.com/rayanht/alloy/main/installer/install.sh | sh ``` Installs a self-contained `alloy` CLI into `/usr/local` . **From source (contributors):** see [Contributing](#contributing). Alloy serves a loopback HTTP API that's drop-in compatible with the OpenAI, Anthropic and Ollama clients. Important Run `alloy tune ` before serving for optimal performance ``` # Start the server in the foreground; loads the model # from a local Ollama cache or Hugging Face if present. alloy serve -m qwen3:0.6b # Ollama tag alloy serve -m bartowski/Llama-3.2-3B-Instruct-GGUF:Q4_K_M # HF model # OpenAI: curl http://127.0.0.1:11434/v1/chat/completions \ -H 'content-type: application/json' \ -d '{"model":"qwen3:0.6b","messages":[{"role":"user","content":"hi"}]}' # Ollama: curl http://127.0.0.1:11434/api/chat \ -d '{"model":"qwen3:0.6b","messages":[{"role":"user","content":"hi"}]}' # Claude Code alloy launch claude ``` The default port is `11434` . Pass `--port 11435` to `alloy serve` (or set `ALLOY_PORT` ) to override. | Feature | Status | |---|---| | Warm-prefix KV reuse (bookmarks + branching) | Stable | | On-GPU sampling (temp / top-p / top-k / min-p / seed) | Stable | | Constrained decoding (xgrammar JSON + tool grammars) | Stable | | Tool calling (OpenAI / Anthropic / Ollama, per-family parsers) | Stable | | Reasoning / thinking split | Stable | | MoE inference | Stable (Qwen3.5-MoE) | | Vision input | Stable (gemma4) | | Audio input | Stable (gemma4) | | Embeddings | Stable (nomic-embed-text) | | Speculative decoding — PLD (prompt lookup) | Opt-in (`--spec pld` ) | | Speculative decoding — MTP | Opt-in (`--spec mtp` , Qwen3.5) | | Speculative decoding — DFlash (block diffusion) | Opt-in (`--spec dflash` ) | | Paged KV cache | Opt-in (`ALLOY_KV=paged` ) | | KV cache quantization (int8 + fp16 scales) | Opt-in (`--kv-quant q8_0` ) | ## Supported quantizations **Model weights** | source | format | supported | |---|---|---| | GGUF | Q4_K (Q4_K_M / Q4_K_S) | ✅ | | GGUF | Q5_0 | ✅ | | GGUF | Q6_K | ✅ | | GGUF | Q8_0 | ✅ | | GGUF | F16 / BF16 / F32 | ✅ | | GGUF | Q2_K / Q3_K / Q5_K | ❌ | | GGUF | Q4_0 / Q4_1 / Q5_1 | ❌ | | GGUF | IQ1 / IQ2 / IQ3 / IQ4 (IQ4_XS, IQ4_NL) | ❌ | | MLX | 4-bit affine (group size 64 / 128) | ✅ | | MLX | 2-bit / 3-bit / 6-bit / 8-bit | ❌ | **KV cache** | format | supported | |---|---| | fp16 (default) | ✅ | | q8_0 | ✅ | | q4 / other | ❌ | Alloy includes a `torch.compile` backend that compiles covered PyTorch FX graphs to fused Metal compute kernels. ``` python import torch import transformers import alloy_torch # registers the "alloy" backend model = transformers.AutoModelForCausalLM.from_pretrained("gpt2").eval() compiled = torch.compile(model, backend="alloy") input_ids = torch.randint(0, model.config.vocab_size, (1, 16)) output = compiled(input_ids=input_ids) ``` The backend handles: FX graph decomposition, operator fusion (RMSNorm, RoPE, GELU, batched QKV, GEMM+LayerNorm, scalar broadcast), GQA-native attention, compiled dispatch plans, and tuning. Runnable model examples live in [ examples/torch/](/rayanht/alloy/blob/main/examples/torch): — multi-layer perceptron (Linear / LayerNorm / GELU)`mlp.py` — GroupNorm ResNet (Conv2d + residual blocks)`resnet.py` — pre-norm encoder block (SDPA + GELU MLP)`transformer.py` A full `torch.compile` training step (forward, backward, and the optimizer update) runs end to end through Alloy and matches PyTorch eager within floating-point tolerance for dense transformer-style models: embeddings, linear layers, normalization, residual blocks, attention, cross-entropy, and the common optimizers (SGD, Adam, AdamW, RMSprop). A small language model trains end to end, and LoRA fine-tuning of a pretrained transformer works in `model.train()` . Enable it before `torch.compile` : ``` python import torch import torch.nn as nn import torch.nn.functional as F import alloy_torch # registers the "alloy" backend from alloy_torch.training import set_training_mode set_training_mode(True) # before torch.compile model = nn.Sequential(nn.Linear(64, 128), nn.GELU(), nn.Linear(128, 1)) step = torch.compile(model, backend="alloy") opt = torch.optim.AdamW(model.parameters(), lr=0.05) x, y = torch.randn(32, 64), torch.randn(32, 1) for _ in range(20): opt.zero_grad() loss = F.mse_loss(step(x), y) loss.backward() opt.step() ``` Fine-tuning a pretrained transformer with [PEFT](https://github.com/huggingface/peft) LoRA is the same shape: ``` python import peft import transformers model = peft.get_peft_model( transformers.AutoModelForCausalLM.from_pretrained("gpt2"), peft.LoraConfig(target_modules=["c_attn"], task_type="CAUSAL_LM"), ) step = torch.compile(model, backend="alloy") opt = torch.optim.AdamW([p for p in model.parameters() if p.requires_grad], lr=5e-3) model.train() for input_ids in batches: opt.zero_grad() loss = step(input_ids=input_ids, labels=input_ids).loss loss.backward() opt.step() ``` Runnable training examples live in [ examples/torch/](/rayanht/alloy/blob/main/examples/torch): — MLP regression (Linear / LayerNorm / GELU, AdamW)`train_mlp.py` — transformer block + cross-entropy (SGD)`train_transformer.py` — tiny language model (Embedding + attention + cross-entropy)`train_lm.py` — LoRA fine-tuning of gpt2 (PEFT + transformers)`finetune_lora.py` It is still a preview. The backward pass does not yet cover convolutions or pooling, so CNN training is not supported. Inference is the primary, fully validated path. Reproduce with `alloy bench ` | model | quant | pp512 | tg128 | |---|---|---|---| | LFM2.5-1.2B-Instruct-GGUF | Q4_K_M | 4222 | 508 | | bartowski/Llama-3.2-3B-Instruct-GGUF | Q4_K_M | 2061 | 198 | | Qwen_Qwen3-0.6B-GGUF | Q4_K_M | 8311 | 612 | | model | quant | pp512 | tg128 | |---|---|---|---| | qwen2.5:0.5b | Q4_K_M | 12102 | 505 | | qwen3:0.6b | Q4_K_M | 10077 | 584 | | llama3.2:1b | Q8_0 | 5653 | 324 | | qwen3.5:0.8b | Q8_0 | 6141 | 349 | | deepseek-r1:1.5b | Q4_K_M | 3295 | 274 | | qwen2.5:1.5b | Q4_K_M | 3295 | 270 | | qwen3.5:2b | Q8_0 | 3247 | 187 | | gemma4:e2b | Q4_K_M | 2121 | 175 | | qwen2.5:3b | Q4_K_M | 1617 | 185 | | gemma4:e4b | Q4_K_M | 1079 | 115 | | qwen3.5:4b | Q4_K_M | 1098 | 122 | | qwen3.5:9b | Q4_K_M | 598 | 78.6 | | qwen3.6:35b | Q4_K_M | 988 | 121 | | model | quant | pp512 | tg128 | |---|---|---|---| | Qwen/Qwen3-0.6B-MLX-4bit | 4-bit g128 | 10063 | 710 | | LiquidAI/LFM2.5-1.2B-Instruct-MLX-4bit | 4-bit g64 | 5688 | 589 | | mlx-community/Llama-3.2-3B-Instruct-4bit | 4-bit g64 | 2173 | 220 | | mlx-community/Qwen3-4B-4bit | 4-bit g64 | 1673 | 174 | | mlx-community/Qwen3-8B-4bit | 4-bit g64 | 866 | 102 | | model | vision ms | alloy TTFT | alloy dec | alloy wall | |---|---|---|---|---| | gemma4:e2b | 229 | 455 | 172 | 1193 | | gemma4:e4b | 257 | 665 | 99.0 | 1949 | Per-regime encoder tok/s from `alloy bench nomic-embed-text --dataset embeddings` . | regime | batch | seq | tok/s | |---|---|---|---| | q_short | 1 | 10 | 5094 | | q_long | 1 | 256 | 19161 | | b8_short | 8 | 10 | 14142 | | b8_long | 8 | 128 | 11840 | ``` python import numpy as np import alloy as al @al.kernel def blur(src, dst: al.output, W: al.constexpr, H: al.constexpr): x = al.program_id(0) y = al.program_id(1) acc = 0.0 count = 0 for dy in range(-1, 2): for dx in range(-1, 2): nx = x + dx ny = y + dy if nx >= 0: if nx < W: if ny >= 0: if ny < H: acc = acc + al.load(src + ny * W + nx) count = count + 1 al.store(dst + y * W + x, acc / count) W, H = 1920, 1080 img = np.random.rand(H, W).astype(np.float32) out = blur[W, H](img.ravel(), W=W, H=H) print(np.asarray(out).reshape(H, W)) ``` NumPy and PyTorch arrays can be bound directly as inputs for covered contiguous host-memory paths. The kernel's `al.output` is allocated automatically and returned as an `AlloyBuffer` — convert with `np.asarray(...)` or `.numpy()` . Some interop paths allocate Alloy-owned shared buffers or require layout copies, so this is not a blanket promise that every input type and view is no-copy. More runnable examples live in [ examples/kernel/](/rayanht/alloy/blob/main/examples/kernel): — masked elementwise add`vector_add.py` — fused GELU / sigmoid / SiLU`elementwise.py` — row-wise softmax, manual vs. builtin`softmax.py` — 2D box blur (shown above)`blur.py` — naive vs. tiled GEMM with simdgroup MMA`matmul.py` — atomics`histogram.py` — N-body simulation`nbody.py` — divergent per-thread iteration`mandelbrot.py` — online-softmax attention`flash_attention.py` ``` python @al.kernel def matmul(A, B_T, C: al.output, BLOCK_M: al.constexpr = 64, BLOCK_N: al.constexpr = 64, BLOCK_K: al.constexpr = 16): M, K = A.shape N = B_T.shape[0] pm = al.program_id(0) pn = al.program_id(1) rm = pm * BLOCK_M + al.arange(0, BLOCK_M) rn = pn * BLOCK_N + al.arange(0, BLOCK_N) rk = al.arange(0, BLOCK_K) a_ptrs = A + rm[:, None] * K + rk[None, :] b_ptrs = B_T + rn[:, None] * K + rk[None, :] acc = al.zeros((BLOCK_M, BLOCK_N), dtype=al.float32) for k in range(0, K, BLOCK_K): a = al.load(a_ptrs, mask=(rm[:, None] < M) & (rk[None, :] < K)) b = al.load(b_ptrs, mask=(rn[:, None] < N) & (rk[None, :] < K)) acc += al.tile_dot(a, b, transpose_rhs=True) a_ptrs += BLOCK_K b_ptrs += BLOCK_K al.store(C + rm[:, None] * N + rn[None, :], acc, mask=(rm[:, None] < M) & (rn[None, :] < N)) ``` This compiles to Metal with simdgroup matrix multiply-accumulate (MMA), cooperative tile loads, threadgroup shared memory staging, and optional double buffering all generated automatically from the tile IR. High-performance implementations of common operations. ``` C = al.dot_transpose_rhs(A, B) # tiled GEMM with autotuning s = al.softmax(x) # fused row-wise softmax y = al.layernorm(x, gamma, beta) # fused layer normalization y, _ = al.rms_norm(x, weight) # fused RMS normalization (+ per-row 1/rms) L = al.cross_entropy(logits, labels) # fused cross-entropy loss kernel ``` Builtins infer output shapes and constexpr values from input arrays. They compose with fusion. e.g. `al.dot` followed by an elementwise kernel automatically fuses the elementwise op as an epilogue. ``` # Grid and thread indexing pid = al.program_id(0) # threadgroup position (block index) tid = al.thread_id() # thread position within threadgroup offs = pid * 1024 + al.arange(0, 1024) # block-level offsets # Memory x = al.load(ptr + offs, mask=mask) # masked global load al.store(ptr + offs, val, mask=mask) # masked global store buf = al.shared(256) # threadgroup shared memory loc = al.local(8) # per-thread register array al.barrier() # threadgroup memory barrier al.coop_load(buf, src_ptr, size) # cooperative threadgroup load + barrier al.copy4(dst, offset, src_ptr) # vectorized 4-element load # Tile operations (2D blocks for GEMM, attention, etc.) acc = al.zeros((BLOCK_M, BLOCK_N), dtype=al.float32) acc += al.tile_dot(a, b, transpose_rhs=True) # simdgroup MMA reduced = al.simd_reduce(val) # cross-lane reduction # Simdgroup (warp-level) al.simd_shuffle_xor(val, offset) # butterfly shuffle al.simd_shuffle(val, lane) # read from specific lane acc = al.simd_matrix() # 8x8 matrix accumulator al.simd_load(src, offset, stride) # load into simd matrix al.simd_mma(acc, a, b) # matrix multiply-accumulate # Atomics al.atomic_add(ptr, idx, val) # atomic fetch-and-add (int32) al.atomic_max(ptr, idx, val) al.atomic_cas(ptr, idx, expected, desired) # compare-and-swap # Control flow — plain Python if cond: ... for i in range(N): ... while cond: ... # These three kernels fuse into one — no intermediate buffers allocated. # Each call returns a lazy AlloyBuffer; feed it straight into the next: t1 = scale[grid](x, N=N) # t1 = x * 2.0 t2 = bias[grid](t1, N=N) # t2 = t1 + 1.0 result = activate[grid](t2, N=N) # result = relu(t2) # Reading the result triggers one fused GPU submission: print(result[0]) ``` Pass PyTorch tensors or MLX arrays directly when their storage layout is supported: ``` python import torch x = torch.randn(32, 128) # CPU tensor, lives in unified memory result = my_kernel[grid](x, M=32, N=128) # x bound directly; result returned as an AlloyBuffer ``` Alloy's compiled plans may convert PyTorch input storage to Alloy-owned shared memory on first execution so subsequent dispatches can resolve Metal buffers by handle. That keeps subsequent dispatches free of per-call input copies for stable storage. ``` al.inspect(my_kernel, N=8192) # prints MSL source al.inspect(my_kernel, level="tile-ir", N=8192) # prints tile IR ``` **The problem.** Metal compute is powerful but painful to program. You write MSL in a C++ dialect, manually manage buffer bindings, compile pipeline state objects, and set up command encoders. There's no equivalent of Triton, Numba, or CuPy for Metal. **What Alloy does.** Python → tile IR → MSL, with a runtime that handles dispatch, caching, and optimization: **Shared-memory execution**— Apple Silicon CPU and GPU share physical memory. Alloy binds caller buffers directly where the storage layout supports it, and uses Alloy-owned shared buffers when plan safety or alignment requires it.**Tile IR compiler**— Python kernel source → AST → tile IR (loads, stores, reductions, MMA, barriers) → Metal Shading Language. Handles threadgroup sizing, shared memory allocation, simdgroup decomposition, and barrier placement automatically.**Automatic dispatch**— builtins return lazy buffers that queue GPU work automatically. Reading results triggers a single fused Metal command buffer commit. No manual batch management needed.**Operator fusion**— adjacent elementwise kernels fuse automatically, eliminating intermediate buffers. Elementwise ops fuse as prologues and epilogues into reductions, GEMM, softmax, and layernorm. Transposes fuse via stride absorption.**Tuning**— exhaustive search over tile sizes, loop unrolling, double buffering, and matvec strategies. See [CONTRIBUTING.md](/rayanht/alloy/blob/main/CONTRIBUTING.md) for dev setup, test commands, and PR conventions.