{"slug": "show-hn-alloy-a-pytorch-backend-and-inference-engine-for-apple-silicon", "title": "Show HN: Alloy – a PyTorch backend and inference engine for Apple Silicon", "summary": "Alloy, a new PyTorch backend and inference engine for Apple Silicon, has been released as a technical preview. The open-source project compiles Python GPU kernels to Metal and supports LLM serving with a drop-in compatible API for OpenAI, Anthropic, and Ollama clients. Alloy includes a torch.compile backend and features like warm-prefix KV reuse, on-GPU sampling, and speculative decoding.", "body_md": "Kernel authoring DSL, `torch.compile`\n\nbackend and LLM serving for Apple Silicon.\n\nAlloy is a compiler and runtime for GPU compute kernels on Apple Silicon. You write kernels in Python. Alloy compiles them to Metal through a tile IR pipeline; covering everything from per-thread scalar kernels to cooperative tiled GEMM with simdgroup MMA and automatic operator fusion for multi-kernel pipelines.\n\n**Status**: technical preview. Requires Apple Silicon (M1+) and macOS 13+. The\nPython packages need Python 3.10–3.12.\n\n[Install](#install)[Inference server - Quickstart](#inference-server---quickstart)[torch.compile backend](#torchcompile-backend)[Benchmarks](#benchmarks)[Writing kernels](#writing-kernels)[Why Alloy](#why-alloy)[Contributing](#contributing)[License](#license)\n\n**Python (pip / uv)**\n\n```\npip install 'alloy-kit[serve]'   # local LLM server + CLI + torch.compile backend\npip install alloy-kit            # lean: just the GPU kernel compiler (no torch)\npip install 'alloy-kit[all]'     # + training / vision / audio research extras\n\n# import alloy as al\n```\n\nThe PyPI distribution is ** alloy-kit**. The brackets are optional\ndependency groups: the lean base provides\n\n`@al.kernel`\n\nwith the tile IR, MSL emitter and Metal dispatch machinery,\nand `[serve]`\n\nadds everything needed to run the server and the `alloy`\n\nCLI.**Standalone (no Python required):**\n\n```\ncurl -fsSL https://raw.githubusercontent.com/rayanht/alloy/main/installer/install.sh | sh\n```\n\nInstalls a self-contained `alloy`\n\nCLI into `/usr/local`\n\n.\n\n**From source (contributors):** see [Contributing](#contributing).\n\nAlloy serves a loopback HTTP API that's drop-in compatible with the OpenAI, Anthropic and Ollama clients.\n\nImportant\n\nRun `alloy tune <model>`\n\nbefore serving for optimal performance\n\n```\n# Start the server in the foreground; loads the model\n# from a local Ollama cache or Hugging Face if present.\n\nalloy serve -m qwen3:0.6b                                   # Ollama tag\nalloy serve -m bartowski/Llama-3.2-3B-Instruct-GGUF:Q4_K_M  # HF model\n# OpenAI:\ncurl http://127.0.0.1:11434/v1/chat/completions \\\n  -H 'content-type: application/json' \\\n  -d '{\"model\":\"qwen3:0.6b\",\"messages\":[{\"role\":\"user\",\"content\":\"hi\"}]}'\n\n# Ollama:\ncurl http://127.0.0.1:11434/api/chat \\\n  -d '{\"model\":\"qwen3:0.6b\",\"messages\":[{\"role\":\"user\",\"content\":\"hi\"}]}'\n# Claude Code\nalloy launch claude\n```\n\nThe default port is `11434`\n\n. Pass `--port 11435`\n\nto `alloy serve`\n\n(or set `ALLOY_PORT`\n\n) to override.\n\n| Feature | Status |\n|---|---|\n| Warm-prefix KV reuse (bookmarks + branching) | Stable |\n| On-GPU sampling (temp / top-p / top-k / min-p / seed) | Stable |\n| Constrained decoding (xgrammar JSON + tool grammars) | Stable |\n| Tool calling (OpenAI / Anthropic / Ollama, per-family parsers) | Stable |\n| Reasoning / thinking split | Stable |\n| MoE inference | Stable (Qwen3.5-MoE) |\n| Vision input | Stable (gemma4) |\n| Audio input | Stable (gemma4) |\n| Embeddings | Stable (nomic-embed-text) |\n| Speculative decoding — PLD (prompt lookup) | Opt-in (`--spec pld` ) |\n| Speculative decoding — MTP | Opt-in (`--spec mtp` , Qwen3.5) |\n| Speculative decoding — DFlash (block diffusion) | Opt-in (`--spec dflash` ) |\n| Paged KV cache | Opt-in (`ALLOY_KV=paged` ) |\n| KV cache quantization (int8 + fp16 scales) | Opt-in (`--kv-quant q8_0` ) |\n\n## Supported quantizations\n\n**Model weights**\n\n| source | format | supported |\n|---|---|---|\n| GGUF | Q4_K (Q4_K_M / Q4_K_S) | ✅ |\n| GGUF | Q5_0 | ✅ |\n| GGUF | Q6_K | ✅ |\n| GGUF | Q8_0 | ✅ |\n| GGUF | F16 / BF16 / F32 | ✅ |\n| GGUF | Q2_K / Q3_K / Q5_K | ❌ |\n| GGUF | Q4_0 / Q4_1 / Q5_1 | ❌ |\n| GGUF | IQ1 / IQ2 / IQ3 / IQ4 (IQ4_XS, IQ4_NL) | ❌ |\n| MLX | 4-bit affine (group size 64 / 128) | ✅ |\n| MLX | 2-bit / 3-bit / 6-bit / 8-bit | ❌ |\n\n**KV cache**\n\n| format | supported |\n|---|---|\n| fp16 (default) | ✅ |\n| q8_0 | ✅ |\n| q4 / other | ❌ |\n\nAlloy includes a `torch.compile`\n\nbackend that compiles covered PyTorch FX graphs to\nfused Metal compute kernels.\n\n``` python\nimport torch\nimport transformers\nimport alloy_torch  # registers the \"alloy\" backend\n\nmodel = transformers.AutoModelForCausalLM.from_pretrained(\"gpt2\").eval()\ncompiled = torch.compile(model, backend=\"alloy\")\n\ninput_ids = torch.randint(0, model.config.vocab_size, (1, 16))\noutput = compiled(input_ids=input_ids)\n```\n\nThe backend handles: FX graph decomposition, operator fusion (RMSNorm, RoPE, GELU, batched QKV, GEMM+LayerNorm, scalar broadcast), GQA-native attention, compiled dispatch plans, and tuning.\n\nRunnable model examples live in [ examples/torch/](/rayanht/alloy/blob/main/examples/torch):\n\n— multi-layer perceptron (Linear / LayerNorm / GELU)`mlp.py`\n\n— GroupNorm ResNet (Conv2d + residual blocks)`resnet.py`\n\n— pre-norm encoder block (SDPA + GELU MLP)`transformer.py`\n\nA full `torch.compile`\n\ntraining step (forward, backward, and the optimizer\nupdate) runs end to end through Alloy and matches PyTorch eager within\nfloating-point tolerance for dense transformer-style models: embeddings, linear\nlayers, normalization, residual blocks, attention, cross-entropy, and the common\noptimizers (SGD, Adam, AdamW, RMSprop). A small language model trains end to end,\nand LoRA fine-tuning of a pretrained transformer works in `model.train()`\n\n. Enable\nit before `torch.compile`\n\n:\n\n``` python\nimport torch\nimport torch.nn as nn\nimport torch.nn.functional as F\nimport alloy_torch  # registers the \"alloy\" backend\nfrom alloy_torch.training import set_training_mode\n\nset_training_mode(True)  # before torch.compile\n\nmodel = nn.Sequential(nn.Linear(64, 128), nn.GELU(), nn.Linear(128, 1))\nstep = torch.compile(model, backend=\"alloy\")\nopt = torch.optim.AdamW(model.parameters(), lr=0.05)\n\nx, y = torch.randn(32, 64), torch.randn(32, 1)\nfor _ in range(20):\n    opt.zero_grad()\n    loss = F.mse_loss(step(x), y)\n    loss.backward()\n    opt.step()\n```\n\nFine-tuning a pretrained transformer with [PEFT](https://github.com/huggingface/peft) LoRA is the same shape:\n\n``` python\nimport peft\nimport transformers\n\nmodel = peft.get_peft_model(\n    transformers.AutoModelForCausalLM.from_pretrained(\"gpt2\"),\n    peft.LoraConfig(target_modules=[\"c_attn\"], task_type=\"CAUSAL_LM\"),\n)\nstep = torch.compile(model, backend=\"alloy\")\nopt = torch.optim.AdamW([p for p in model.parameters() if p.requires_grad], lr=5e-3)\n\nmodel.train()\nfor input_ids in batches:\n    opt.zero_grad()\n    loss = step(input_ids=input_ids, labels=input_ids).loss\n    loss.backward()\n    opt.step()\n```\n\nRunnable training examples live in [ examples/torch/](/rayanht/alloy/blob/main/examples/torch):\n\n— MLP regression (Linear / LayerNorm / GELU, AdamW)`train_mlp.py`\n\n— transformer block + cross-entropy (SGD)`train_transformer.py`\n\n— tiny language model (Embedding + attention + cross-entropy)`train_lm.py`\n\n— LoRA fine-tuning of gpt2 (PEFT + transformers)`finetune_lora.py`\n\nIt is still a preview. The backward pass does not yet cover convolutions or pooling, so CNN training is not supported. Inference is the primary, fully validated path.\n\nReproduce with `alloy bench <HF_OR_OLLAMA_TAG>`\n\n| model | quant | pp512 | tg128 |\n|---|---|---|---|\n| LFM2.5-1.2B-Instruct-GGUF | Q4_K_M | 4222 | 508 |\n| bartowski/Llama-3.2-3B-Instruct-GGUF | Q4_K_M | 2061 | 198 |\n| Qwen_Qwen3-0.6B-GGUF | Q4_K_M | 8311 | 612 |\n\n| model | quant | pp512 | tg128 |\n|---|---|---|---|\n| qwen2.5:0.5b | Q4_K_M | 12102 | 505 |\n| qwen3:0.6b | Q4_K_M | 10077 | 584 |\n| llama3.2:1b | Q8_0 | 5653 | 324 |\n| qwen3.5:0.8b | Q8_0 | 6141 | 349 |\n| deepseek-r1:1.5b | Q4_K_M | 3295 | 274 |\n| qwen2.5:1.5b | Q4_K_M | 3295 | 270 |\n| qwen3.5:2b | Q8_0 | 3247 | 187 |\n| gemma4:e2b | Q4_K_M | 2121 | 175 |\n| qwen2.5:3b | Q4_K_M | 1617 | 185 |\n| gemma4:e4b | Q4_K_M | 1079 | 115 |\n| qwen3.5:4b | Q4_K_M | 1098 | 122 |\n| qwen3.5:9b | Q4_K_M | 598 | 78.6 |\n| qwen3.6:35b | Q4_K_M | 988 | 121 |\n\n| model | quant | pp512 | tg128 |\n|---|---|---|---|\n| Qwen/Qwen3-0.6B-MLX-4bit | 4-bit g128 | 10063 | 710 |\n| LiquidAI/LFM2.5-1.2B-Instruct-MLX-4bit | 4-bit g64 | 5688 | 589 |\n| mlx-community/Llama-3.2-3B-Instruct-4bit | 4-bit g64 | 2173 | 220 |\n| mlx-community/Qwen3-4B-4bit | 4-bit g64 | 1673 | 174 |\n| mlx-community/Qwen3-8B-4bit | 4-bit g64 | 866 | 102 |\n\n| model | vision ms | alloy TTFT | alloy dec | alloy wall |\n|---|---|---|---|---|\n| gemma4:e2b | 229 | 455 | 172 | 1193 |\n| gemma4:e4b | 257 | 665 | 99.0 | 1949 |\n\nPer-regime encoder tok/s from `alloy bench nomic-embed-text --dataset embeddings`\n\n.\n\n| regime | batch | seq | tok/s |\n|---|---|---|---|\n| q_short | 1 | 10 | 5094 |\n| q_long | 1 | 256 | 19161 |\n| b8_short | 8 | 10 | 14142 |\n| b8_long | 8 | 128 | 11840 |\n\n``` python\nimport numpy as np\nimport alloy as al\n\n@al.kernel\ndef blur(src, dst: al.output, W: al.constexpr, H: al.constexpr):\n    x = al.program_id(0)\n    y = al.program_id(1)\n    acc = 0.0\n    count = 0\n    for dy in range(-1, 2):\n        for dx in range(-1, 2):\n            nx = x + dx\n            ny = y + dy\n            if nx >= 0:\n                if nx < W:\n                    if ny >= 0:\n                        if ny < H:\n                            acc = acc + al.load(src + ny * W + nx)\n                            count = count + 1\n    al.store(dst + y * W + x, acc / count)\n\nW, H = 1920, 1080\nimg = np.random.rand(H, W).astype(np.float32)\n\nout = blur[W, H](img.ravel(), W=W, H=H)\nprint(np.asarray(out).reshape(H, W))\n```\n\nNumPy and PyTorch arrays can be bound directly as inputs for covered contiguous\nhost-memory paths. The kernel's `al.output`\n\nis allocated automatically and returned as an\n`AlloyBuffer`\n\n— convert with `np.asarray(...)`\n\nor `.numpy()`\n\n. Some interop paths\nallocate Alloy-owned shared buffers or require layout copies, so this is not a blanket\npromise that every input type and view is no-copy.\n\nMore runnable examples live in [ examples/kernel/](/rayanht/alloy/blob/main/examples/kernel):\n\n— masked elementwise add`vector_add.py`\n\n— fused GELU / sigmoid / SiLU`elementwise.py`\n\n— row-wise softmax, manual vs. builtin`softmax.py`\n\n— 2D box blur (shown above)`blur.py`\n\n— naive vs. tiled GEMM with simdgroup MMA`matmul.py`\n\n— atomics`histogram.py`\n\n— N-body simulation`nbody.py`\n\n— divergent per-thread iteration`mandelbrot.py`\n\n— online-softmax attention`flash_attention.py`\n\n``` python\n@al.kernel\ndef matmul(A, B_T, C: al.output,\n           BLOCK_M: al.constexpr = 64, BLOCK_N: al.constexpr = 64, BLOCK_K: al.constexpr = 16):\n    M, K = A.shape\n    N = B_T.shape[0]\n    pm = al.program_id(0)\n    pn = al.program_id(1)\n    rm = pm * BLOCK_M + al.arange(0, BLOCK_M)\n    rn = pn * BLOCK_N + al.arange(0, BLOCK_N)\n    rk = al.arange(0, BLOCK_K)\n    a_ptrs = A + rm[:, None] * K + rk[None, :]\n    b_ptrs = B_T + rn[:, None] * K + rk[None, :]\n    acc = al.zeros((BLOCK_M, BLOCK_N), dtype=al.float32)\n    for k in range(0, K, BLOCK_K):\n        a = al.load(a_ptrs, mask=(rm[:, None] < M) & (rk[None, :] < K))\n        b = al.load(b_ptrs, mask=(rn[:, None] < N) & (rk[None, :] < K))\n        acc += al.tile_dot(a, b, transpose_rhs=True)\n        a_ptrs += BLOCK_K\n        b_ptrs += BLOCK_K\n    al.store(C + rm[:, None] * N + rn[None, :], acc, mask=(rm[:, None] < M) & (rn[None, :] < N))\n```\n\nThis compiles to Metal with simdgroup matrix multiply-accumulate (MMA), cooperative tile loads, threadgroup shared memory staging, and optional double buffering all generated automatically from the tile IR.\n\nHigh-performance implementations of common operations.\n\n```\nC = al.dot_transpose_rhs(A, B)               # tiled GEMM with autotuning\ns = al.softmax(x)                            # fused row-wise softmax\ny = al.layernorm(x, gamma, beta)             # fused layer normalization\ny, _ = al.rms_norm(x, weight)                # fused RMS normalization (+ per-row 1/rms)\nL = al.cross_entropy(logits, labels)         # fused cross-entropy loss kernel\n```\n\nBuiltins infer output shapes and constexpr values from input arrays. They compose with fusion. e.g. `al.dot`\n\nfollowed by an elementwise kernel automatically fuses the elementwise op as an epilogue.\n\n```\n# Grid and thread indexing\npid = al.program_id(0)                  # threadgroup position (block index)\ntid = al.thread_id()                    # thread position within threadgroup\noffs = pid * 1024 + al.arange(0, 1024)  # block-level offsets\n\n# Memory\nx = al.load(ptr + offs, mask=mask)       # masked global load\nal.store(ptr + offs, val, mask=mask)     # masked global store\nbuf = al.shared(256)                     # threadgroup shared memory\nloc = al.local(8)                        # per-thread register array\nal.barrier()                             # threadgroup memory barrier\nal.coop_load(buf, src_ptr, size)         # cooperative threadgroup load + barrier\nal.copy4(dst, offset, src_ptr)           # vectorized 4-element load\n\n# Tile operations (2D blocks for GEMM, attention, etc.)\nacc = al.zeros((BLOCK_M, BLOCK_N), dtype=al.float32)\nacc += al.tile_dot(a, b, transpose_rhs=True)  # simdgroup MMA\nreduced = al.simd_reduce(val)                 # cross-lane reduction\n\n# Simdgroup (warp-level)\nal.simd_shuffle_xor(val, offset)         # butterfly shuffle\nal.simd_shuffle(val, lane)               # read from specific lane\nacc = al.simd_matrix()                   # 8x8 matrix accumulator\nal.simd_load(src, offset, stride)        # load into simd matrix\nal.simd_mma(acc, a, b)                   # matrix multiply-accumulate\n\n# Atomics\nal.atomic_add(ptr, idx, val)                # atomic fetch-and-add (int32)\nal.atomic_max(ptr, idx, val)\nal.atomic_cas(ptr, idx, expected, desired)  # compare-and-swap\n\n# Control flow — plain Python\nif cond: ...\nfor i in range(N): ...\nwhile cond: ...\n# These three kernels fuse into one — no intermediate buffers allocated.\n# Each call returns a lazy AlloyBuffer; feed it straight into the next:\nt1 = scale[grid](x, N=N)          # t1 = x * 2.0\nt2 = bias[grid](t1, N=N)          # t2 = t1 + 1.0\nresult = activate[grid](t2, N=N)  # result = relu(t2)\n\n# Reading the result triggers one fused GPU submission:\nprint(result[0])\n```\n\nPass PyTorch tensors or MLX arrays directly when their storage layout is supported:\n\n``` python\nimport torch\nx = torch.randn(32, 128)                  # CPU tensor, lives in unified memory\nresult = my_kernel[grid](x, M=32, N=128)  # x bound directly; result returned as an AlloyBuffer\n```\n\nAlloy's compiled plans may convert PyTorch input storage to Alloy-owned shared memory on first execution so subsequent dispatches can resolve Metal buffers by handle. That keeps subsequent dispatches free of per-call input copies for stable storage.\n\n```\nal.inspect(my_kernel, N=8192)                      # prints MSL source\nal.inspect(my_kernel, level=\"tile-ir\", N=8192)     # prints tile IR\n```\n\n**The problem.** Metal compute is powerful but painful to program. You write MSL in a C++ dialect, manually manage buffer bindings, compile pipeline state objects, and set up command encoders. There's no equivalent of Triton, Numba, or CuPy for Metal.\n\n**What Alloy does.** Python → tile IR → MSL, with a runtime that handles dispatch, caching, and optimization:\n\n**Shared-memory execution**— Apple Silicon CPU and GPU share physical memory. Alloy binds caller buffers directly where the storage layout supports it, and uses Alloy-owned shared buffers when plan safety or alignment requires it.**Tile IR compiler**— Python kernel source → AST → tile IR (loads, stores, reductions, MMA, barriers) → Metal Shading Language. Handles threadgroup sizing, shared memory allocation, simdgroup decomposition, and barrier placement automatically.**Automatic dispatch**— builtins return lazy buffers that queue GPU work automatically. Reading results triggers a single fused Metal command buffer commit. No manual batch management needed.**Operator fusion**— adjacent elementwise kernels fuse automatically, eliminating intermediate buffers. Elementwise ops fuse as prologues and epilogues into reductions, GEMM, softmax, and layernorm. Transposes fuse via stride absorption.**Tuning**— exhaustive search over tile sizes, loop unrolling, double buffering, and matvec strategies.\n\nSee [CONTRIBUTING.md](/rayanht/alloy/blob/main/CONTRIBUTING.md) for dev setup, test commands, and PR conventions.", "url": "https://wpnews.pro/news/show-hn-alloy-a-pytorch-backend-and-inference-engine-for-apple-silicon", "canonical_source": "https://github.com/rayanht/alloy", "published_at": "2026-06-20 21:32:51+00:00", "updated_at": "2026-06-20 22:07:53.869660+00:00", "lang": "en", "topics": ["machine-learning", "large-language-models", "ai-infrastructure", "developer-tools"], "entities": ["Alloy", "Apple Silicon", "PyTorch", "Metal", "OpenAI", "Anthropic", "Ollama", "Hugging Face"], "alternates": {"html": "https://wpnews.pro/news/show-hn-alloy-a-pytorch-backend-and-inference-engine-for-apple-silicon", "markdown": "https://wpnews.pro/news/show-hn-alloy-a-pytorch-backend-and-inference-engine-for-apple-silicon.md", "text": "https://wpnews.pro/news/show-hn-alloy-a-pytorch-backend-and-inference-engine-for-apple-silicon.txt", "jsonld": "https://wpnews.pro/news/show-hn-alloy-a-pytorch-backend-and-inference-engine-for-apple-silicon.jsonld"}}