{"slug": "show-hn-iphone-ane-holds-llm-tok-s-while-mlx-and-litert-thermal-throttle", "title": "Show HN: iPhone ANE holds LLM tok/s while MLX and LiteRT thermal-throttle", "summary": "A new open-source benchmark, \"apple-silicon-llm-bench,\" reveals that Google's LiteRT-LM runtime outperforms MLX-Swift on the iPhone 17 Pro for Gemma 4 E2B inference, achieving 55.4 tok/s with 4.5x less memory usage. The benchmark, which tests LLM performance across Apple Silicon devices under real constraints, found MLX-Swift wins on Qwen 3.5 2B decode at 61 tok/s, while CoreML/ANE delivers the smallest memory footprint at 241 MB but the slowest throughput. The results highlight that the optimal runtime depends on the specific model, with purpose-built formats like LiteRT-LM and ANE's memory efficiency challenging general-purpose frameworks.", "body_md": "**On-device LLM benchmark for Apple Silicon — iPhone · iPad · Mac.**\n\nA neutral, reproducible benchmark for running local LLMs (and, in time, ASR / TTS) on Apple Silicon. Compares **MLX Swift, llama.cpp, CoreML (swift-transformers), LiteRT-LM, ExecuTorch, ANEMLL** — and Apple's own Foundation Models — under real device constraints, not just `tok/s`\n\non a server.\n\nRepo:\n\n`apple-silicon-llm-bench`\n\n· CLI/brand:`yardstick`\n\n. Started life as`ios-llm-benchmark`\n\n— iPhone is still the headline target, now measured alongside iPad and Mac.\n\nReal LLM inference on a phone — on-device, no server. iPhone 17 Pro, 4-bit, short-chat (128 tokens), median of 3 cold runs. **The winning runtime is model-dependent — and the upset is on Gemma.**\n\n**Decode throughput** — tok/s, higher is better (🏆 = winner):\n\n| Model (4-bit, n=3) | 🔴 LiteRT-LM | 🟣 MLX-Swift | 🔵 llama.cpp | 🟠 CoreML/ANE |\n|---|---|---|---|---|\n| Gemma 4 E2B | 55.4 🏆 |\n47.5 | 37.8 | 33.4 |\n| Qwen 3.5 2B | — | 61.2 🏆 |\n39.1 | 27.9 |\n\n**Peak memory** — MB, lower is better (🏆 = winner):\n\n| Model (4-bit, n=3) | 🔴 LiteRT-LM | 🟣 MLX-Swift | 🔵 llama.cpp | 🟠 CoreML/ANE |\n|---|---|---|---|---|\n| Gemma 4 E2B | 641 🏆 |\n2,900 | 3,156 | 1,187 |\n| Qwen 3.5 2B | — | 1,279 | 1,479 | 241 🏆 |\n\n**The upset — Gemma 4 E2B:** Google's**LiteRT-LM**(INT4-QAT, GPU, its native`.litertlm`\n\n) beats MLX-Swift on decode**and** uses ~4.5× less memory (641 MB vs 2,900). The purpose-built runtime wins on its own format.**MLX-Swift wins Qwen 3.5 2B decode**— 61 vs 39 tok/s. (LiteRT-LM has no Qwen entry — its catalog is Gemma-only.)** CoreML / ANE is the memory champion**— Qwen 3.5 2B in just** 241 MB**(~5× leaner than MLX's 1,279) via chunked-MLKV on the Neural Engine — but it's the** slowest decode**(ANE trades throughput for footprint), same story as on M4 Max.** ANE is near-parity with the desktop:**CoreML Gemma 4 E2B does 33 tok/s on iPhone vs 32.5 on M4 Max — same silicon family. The** GPU**runtimes pay the real on-device tax: ~4–5× slower than M4 Max (Qwen 3.5 2B → 61 tok/s vs 292).** Counting:**MLX / llama.cpp / LiteRT-LM report exact tokenizer tokens (LiteRT-LM via`getBenchmarkInfo`\n\n); CoreML/ANE counts streamed pieces (≈ tokens). LiteRT-LM runs to EOS (no per-call cap → ~458-tok reply vs the others' 128 budget); decode tok/s is a rate, so the head-to-head holds.**Fully automated, side-loaded** via`devicectl`\n\nheadless mode — nothing typed on the phone, same methodology as the desktop rows.**Coming next:** Apple Foundation Models, more models and more iPhones / iPads.[One row is a great PR](/john-rocky/apple-silicon-llm-bench/blob/main/CONTRIBUTING.md).\n\nHow the LiteRT-LM row was measured:`google-ai-edge/LiteRT-LM`\n\n0.12.0 running`litert-community/gemma-4-E2B-it.litertlm`\n\n(INT4-QAT) on the MetalGPUbackend, via the in-tree[adapter — same headless harness + prompt as every other row (3 cold runs, median). Token counts and tok/s come from]`MediaPipeRuntime`\n\nLiteRT-LM's own benchmark counters(`Conversation.getBenchmarkInfo`\n\n), so they're exact, not estimated. It generates to EOS (no per-call output cap in the API), so its token count is the model's full reply rather than the 128-token budget — decode tok/s is a rate and stays comparable; memory is exact process RSS. LiteRT-LM is vendored as alocal SwiftPM package(`scripts/bootstrap.sh`\n\nclones it with`GIT_LFS_SKIP_SMUDGE=1`\n\n; the released package trips SwiftPM's unsafe-flags rule via its`-all_load`\n\n).\n\nHow the CoreML/ANE rows were measured:`john-rocky/CoreML-LLM`\n\non the Neural Engine (`computeUnits: .cpuAndNeuralEngine`\n\n) — Gemma 4 E2B via the chunked`.mlmodelc`\n\npath, Qwen 3.5 2B via`Qwen35MLKVGenerator`\n\n(chunked MLKV, hence the 241 MB). Decode counts streamed pieces (≈ tokens); first-load ANE compilation makes its load time high (and it's the lowest-throughput runtime — the ANE trades speed for memory).Decode tok/s is the headline number; the full per-run audit (prefill, TTFT, inter-token jitter, memory) lives in\n\n[.]`RESULTS.md`\n\nThe table above is **cold-burst** speed. Run the same model **continuously** and it flips: the GPU runtimes (MLX, LiteRT-LM) heat up and throttle **50%+ within ~60 s**, while the **ANE barely moves** — it draws ~half the power, so it heats slowly and the SoC doesn't throttle it.\n\n| Gemma 4 E2B, iPhone 17 Pro | Burst tok/s | Sustained (10 min) | Retained |\n|---|---|---|---|\nCoreML / ANE |\n33 | 22 |\n67% |\n| MLX / GPU | 48 | 18 | 38% |\n| LiteRT-LM / GPU | 56 | 27 | 48% |\n\nTwo **independent** GPU runtimes collapsing the same way is a GPU-thermal property of the phone, not a runtime quirk. MLX ends up *below* the ANE; LiteRT keeps only a slim lead after shedding half its speed. **The GPU wins the sprint; the ANE wins the marathon** — and it frees the GPU for the rest of the app.\n\nMethod: 600 s continuous generation, cold (\n\n`nominal`\n\n) start, unplugged, tg128; decode rate from a rolling window. Raw JSONL in`results/raw/iphone17pro-*-energy-tg128.jsonl`\n\n; redraw with[(curves table via]`scripts/throttle_chart.py`\n\n[). LiteRT-LM has no output-token cap (longer per-call) and that run started at]`scripts/throttle_curve.py`\n\n`fair`\n\nthermal; CoreML-LLM uses sliding-window attention (bounded context), part of why its decode stays flat.\n\nThe same harness on a laptop-class chip, for scale. No runtime wins everything here — each optimises a different corner of the throughput / memory / energy / streaming box:\n\n**mlx-swift** wins decode throughput on every cell measured (1.4×–1.8× over llama.cpp after early-2026 kernel updates).**Apple Foundation Models** is 2× more energy-efficient per token than the GPU-backed runtimes, 4× more than CoreML/ANE.**CoreML / ANE** wins peak memory (chunked MLKV) but is the slowest*and*the worst on J/token.**llama.cpp** sits in the middle on speed and energy — no axis it wins, no axis it loses badly.\n\nTables for the exact numbers live below. |\n\nRegenerate after adding rows: `python scripts/generate_charts.py`\n\n.\n\nOne device, four runtimes, multiple models. Decode tok/s is the primary headline number; the full table (prefill, TTFT, peak memory, per-run audit trail) lives in\n\n[. Read the]`RESULTS.md`\n\n[Headline observations]section before drawing conclusions — the runtime ranking ismodel-size-dependent.\n\n| Logical model | Params | n | mlx-swift (Q4) | llama.cpp (Q4_K_M) | coreml-llm | litert-lm (.litertlm) |\n|---|---|---|---|---|---|---|\n| Qwen 2.5 0.5B | 0.5 B | 3 | 531.1 |\n297.1 | 181.2 (FP16) | n/a |\n| Qwen 3.5 0.8B | 0.8 B | 3 | 421.1 |\n201.1 | 58.2 (INT8) | n/a |\n| Qwen 3.5 2B | 2 B | 3 | 291.9 |\n149.7 | 35.0 (INT8) | n/a |\n| Gemma 4 E2B | 2 B | 3 | 185.4 |\n119.2 | 32.5 (INT4 palettized) | pending |\n| Gemma 4 E4B | 4 B | 3 | 113.5 |\n80.5 | not run |\npending |\n\n`litert-lm`\n\ncolumn:= adapter wired againstpending`google-ai-edge/LiteRT-LM`\n\nv0.12.0, M4 Max run not yet captured (see[/]`RESULTS.md`\n\n`Yardstick_USER_RUNS.md`\n\n).n/a= LiteRT-LM's catalog is Gemma-only (`.litertlm`\n\n), so the Qwen rows have no entry. For reference, Google's E2B model card reports 56.5 tok/s on iPhone 17 Pro GPU — a vendor figure on a different device, not an M4 Max Yardstick measurement.\n\n→ **MLX-Swift now wins decode on every cell** — 1.4×–1.8× over llama.cpp — after upstream `mlx-swift-lm`\n\nshipped Qwen + Gemma kernel updates in early 2026 (the Qwen rows roughly tripled vs. the snapshot captured before those landed). The old \"llama.cpp Metal always wins small-model decode\" rule is no longer true on M4 Max; re-measure before quoting it. CoreML / ANE is the slowest of the three on every cell, in exchange for the dramatic memory savings shown below.\n\nThe decode-tok/s table above hides the memory side. Same models, looking at peak working-set instead:\n\n| Logical model | Params | mlx-swift | llama.cpp | coreml-llm | litert-lm |\n|---|---|---|---|---|---|\n| Qwen 2.5 0.5B | 0.5 B | 390 |\n538 | 962 | n/a |\n| Qwen 3.5 0.8B | 0.8 B | 600 |\n752 | 221 (INT8) | n/a |\n| Qwen 3.5 2B | 2 B | 1223 | 1443 | 230 (INT8) |\nn/a |\n| Gemma 4 E2B | 2 B | 2829 | 3212 | 1036 |\npending |\n| Gemma 4 E4B | 4 B | 4376 |\n5150 | — | pending |\n\n→ **\"CoreML/ANE wins memory\" is true once the chunked MLKV layout kicks in.** At 0.5 B params MLX-Swift is still smaller (413 MB vs CoreML's 959 MB monolithic FP16); from 0.8 B onward, CoreML's chunked MLKV path (`Qwen35MLKVGenerator`\n\n: mmap'd embed sidecar + on-demand ANE chunks) holds the process RSS roughly flat — 206 MB at 0.8 B, 215 MB at 2 B — while MLX and llama.cpp scale linearly with parameter count.\n\nThe number nobody else publishes: how many joules does each backend burn per generated token? Captured via [ scripts/measure_energy.py](/john-rocky/apple-silicon-llm-bench/blob/main/scripts/measure_energy.py) which co-runs\n\n`powermetrics`\n\n(whole-system, package power = CPU + GPU + ANE) and clips the sample window to the bench's reported active time.The ANE path draws **~half** the GPU path's package power at full decode (12.7 W vs ~24.7 W) — the same power gap that makes the GPU runtimes thermally throttle on iPhone while the ANE holds its rate (see the sustained-throttle section above).\n\n| Runtime | Avg pkg power (W) | Energy / 512-tok run (J) | J / token |\n|---|---|---|---|\napple-fm (system model) |\n7.6 | 67.4 | 0.11 |\n| mlx-swift (4-bit MLX) | 24.7 | 123.0 | 0.24 |\n| llama.cpp (Q4_K_M, GGUF) | 24.5 | 126.3 | 0.25 |\n| coreml-llm (INT4 palettized, ANE) | 12.7 | 244.9 | 0.48 |\n\n→ **Energy ranking inverts the decode-tok/s ranking.** Apple FM is 2× more efficient per token than the GPU-backed runtimes despite producing tokens at ~half the rate. CoreML/ANE has the lowest *instantaneous* power (12.7 W) but is the *worst* J/tok at 4× Apple FM, because the slower decode (32 tok/s) keeps the package powered up much longer. MLX-Swift and llama.cpp draw the most W (GPU) but produce tokens fast enough to break even at ~0.24 J/tok. Whole-system measurement includes the idle baseline so all four numbers slightly inflate per-token energy — useful for ranking, not for absolute attribution. iPhone energy uses the 1 %-battery-step API instead (different methodology, similar table shape).\n\n**llama.cpp** (Q4_K_M GGUF, M4 Max, short-chat)\n\n| Model | Params | n | TTFT (ms) | Decode tok/s | Peak Mem (MB) |\n|---|---|---|---|---|---|\n| Qwen 2.5 0.5B | 0.5 B | 3 | 22 | 297.1 | 538 |\n| Qwen 3.5 0.8B | 0.8 B | 3 | 22 | 201.1 | 752 |\n| Llama 3.2 1B | 1.0 B | 3 | 25 | 285.9 |\n1022 |\n| Qwen 3.5 2B | 2 B | 3 | 29 | 149.7 | 1443 |\n| Gemma 4 E2B | 2 B | 3 | 41 | 119.2 | 3212 |\n| Gemma 4 E4B | 4 B | 3 | 62 | 80.5 | 5150 |\n\n**mlx-swift** (Q4 / MLX, M4 Max, short-chat)\n\n| Model | Params | n | TTFT (ms) | Decode tok/s | Peak Mem (MB) |\n|---|---|---|---|---|---|\n| Qwen 2.5 0.5B | 0.5 B | 3 | 21 | 531.1 |\n390 |\n| Qwen 3.5 0.8B | 0.8 B | 3 | 36 | 421.1 |\n600 |\n| Qwen 3.5 2B | 2 B | 3 | 42 | 291.9 |\n1223 |\n| Gemma 4 E2B | 2 B | 3 | 68 | 185.4 | 2829 |\n| Gemma 4 E4B | 4 B | 3 | 90 | 113.5 | 4376 |\n\n**coreml-llm** (CoreML / ANE, M4 Max, short-chat)\n\n| Model | Params | n | TTFT (ms) | Decode tok/s | Peak Mem (MB) |\n|---|---|---|---|---|---|\n| LFM 2.5 350M | 0.35 B | 1 | 383 | 58.9 | 98 |\n| Qwen 2.5 0.5B | 0.5 B | 3 | 171 | 181.2 | 962 |\n| Qwen 3.5 0.8B | 0.8 B | 3 | 405 | 58.2 | 221 |\n| Qwen 3.5 2B | 2 B | 3 | 665 | 35.0 | 230 |\n| Gemma 4 E2B | 2 B | 3 | 525 | 32.5 | 1036 |\n\n→ CoreML/ANE trades throughput for memory: 3-8× less peak working set than MLX-Swift / llama.cpp at the same model size, at ~half the decode tok/s. The Qwen 3.5 0.8B / 2B numbers come from the dedicated `Qwen35MLKVGenerator`\n\n(ANE chunked decode, KV in `MLState`\n\n— public API since CoreML-LLM `v1.9.0`\n\n), not the generic `CoreMLLLM.load(from:)`\n\npath.\n\nApple FM is a single pre-installed model, so it can't share a \"logical model\" row with the open-weight runtimes above. It earns its own line as a reference point — the number to beat when \"just use the system model\" is the alternative.\n\n| Runtime | Model | n | TTFT (ms) | Decode tok/s | Peak Mem (MB, in-process) |\n|---|---|---|---|---|---|\n| apple-fm | Apple Foundation Model (default, ~3 B params est.) | 3 | 269 | 85.2 | 27 |\n\n**Caveats — read before comparing.**\n\n**Tokens are estimated**(`utf8.count / 4`\n\n) because`FoundationModels`\n\ndoes not expose the tokenizer. Treat decode tok/s as ±20%; the other runtimes report counts from their actual tokenizer.**Peak memory is in-process only.** The model lives in Apple's system process, not ours, so 27 MB is the harness overhead — not the true model footprint. Use Activity Monitor /`powermetrics`\n\nfor the system-wide picture.**Quant is Apple-internal.** Community reverse-engineering puts it at ~2-bit base weights + 4-bit task adapters; Apple has not published numbers. Don't read the decode tok/s as a comment on any specific quant choice.\n\n[Full results — by model, by runtime, full per-run audit trail →](/john-rocky/apple-silicon-llm-bench/blob/main/RESULTS.md)\n\nThis table is the repo. **The easiest possible contribution is one new row.** All three of these are equally valuable:\n\n**A new device.** Run the existing models on your iPhone / iPad / Mac. Tooling in. The \"Devices wanted\" list at the bottom of`Yardstick_USER_RUNS.md`\n\nis the shortlist.`RESULTS.md`\n\n**A new model.** Drop the model id into thefor the runtime that can load it.`ModelCatalog`\n\n**A new runtime.** Wire it up infollowing the`ios/BenchmarkApp/Sources/Runtimes/`\n\n`LLMRuntime`\n\nprotocol; the harness will pick it up.\n\nWorkflow once you have the build set up:\n\n```\n# 1. Run 3 times to get a stable median:\nfor run in 1 2 3; do\n  yardstick run --task short-chat \\\n                --runtime mlx-swift \\\n                --model <id-or-hf-repo> \\\n                --output results/raw/<device>-<runtime>-<model>-short-chat-run${run}.jsonl\ndone\n\n# 2. Regenerate the tables — they're auto-built from JSONL:\npython scripts/render_results.py\n\n# 3. Commit the JSONLs + the updated RESULTS.md, open a PR.\n```\n\nCI runs `python scripts/render_results.py --check`\n\non every PR — it fails if the JSONLs and the tables disagree, so the human-edited section of RESULTS.md cannot drift out of sync with the raw data.\n\nFull step-by-step (build, model picker, device-specific gotchas) lives in [ CONTRIBUTING.md](/john-rocky/apple-silicon-llm-bench/blob/main/CONTRIBUTING.md).\n\nPer `(runtime, model, device, build)`\n\ntuple:\n\n**Speed**— TTFT, prefill`tok/s`\n\n, decode`tok/s`\n\n, sustained-decode drift over 512+ tokens.**Memory**— baseline, peak during decode, after-generation.** Thermal**— initial / peak / final state across the run.** Jitter**— inter-token latency`p50`\n\n/`p95`\n\n/`p99`\n\nms, captured from the gap between consecutive`.chunk`\n\nevents. Surfaces the worst-case stall a chat UI will perceive even when the average decode rate looks smooth.**Energy**— joules per token. iOS uses the 1%-battery-step API; Mac uses`scripts/measure_energy.py`\n\n(wraps`powermetrics`\n\n, see \"Optional: capture Mac energy\" below).**Lifecycle**— survives background → foreground, cancellation latency, streaming.** Quality***(roadmap)*— WER / CER for ASR, perplexity / MMLU for LLM, byte-identical comparison vs Python references.\n\nMethodology lives under [ methodology/](/john-rocky/apple-silicon-llm-bench/blob/main/methodology). The numbers we publish follow\n\n[.](/john-rocky/apple-silicon-llm-bench/blob/main/methodology/fairness-rules.md)\n\n`methodology/fairness-rules.md`\n\n```\nsudo python scripts/measure_energy.py run \\\n     --task short-chat --runtime mlx-swift \\\n     --model mlx-community/gemma-4-e2b-it-4bit \\\n     --output results/raw/<device>-<runtime>-<model>-<task>-energy.jsonl\n```\n\nThe wrapper starts `powermetrics`\n\nin the background, runs `yardstick`\n\n,\nstops `powermetrics`\n\n, then patches the JSONL with `energyJoules`\n\n,\n`averagePackagePowerW`\n\n, and `energyJoulesPerToken`\n\n. Numbers are\nwhole-system — run on an idle desktop and use them to compare\nruntimes on the same Mac, not Macs to each other.\n\nThe iOS app's **History → ••• → Export all (JSONL)** sheet hands you a\nsingle newline-delimited file. AirDrop it to your Mac, then:\n\n```\npython scripts/import_ios_export.py ~/Downloads/yardstick-*.jsonl\npython scripts/render_results.py\n```\n\nThe import script splits the bundle into one\n`results/raw/<device>-<runtime>-<model>-<task>-runN.jsonl`\n\nper row,\nre-keying the device label so `render_results.py`\n\nrecognises it.\n\n```\nYardstick/\n├── Package.swift              SPM: YardstickKit library + `yardstick` Mac CLI\n├── apple/\n│   └── YardstickCLI/          Mac command-line runner\n├── ios/\n│   └── BenchmarkApp/          On-device iOS app (`.xcodeproj`)\n├── runtimes/                  Per-runtime notes (adapters, gotchas, version pins)\n├── devices/                   Per-device pages (chip, RAM, OS, build, signing)\n├── methodology/               How we measure each axis fairly\n├── models/                    Curated model catalog\n├── prompts/                   Standardized prompts per task\n└── results/\n    ├── raw/                   JSONL dumps per run\n    └── (summary tables generated into RESULTS.md)\n```\n\nCurrent status (May 2026): SPM build is clean. Runtime is blocked by[— the MLX Metal kernel bundle isn't emitted by]`ml-explore/mlx-swift#349`\n\n`swift build`\n\nfrom a downstream package, so`swift run yardstick run …`\n\nexits with`Failed to load the default metallib`\n\n. The same workaround applies to`mlx-swift-examples/llm-tool`\n\n(its README says \"Build the llm-tool scheme in Xcode\"). A macOS app target that wraps the CLI through Xcode's Metal toolchain is queued as Phase 2.\n\nWhen the Phase-2 macOS target lands, this is the intended shape:\n\n``` bash\n$ yardstick list\n$ yardstick run --task short-chat \\\n                --runtime mlx-swift \\\n                --model mlx-community/Qwen3-0.6B-4bit \\\n                --output results/raw/m4max-mlx-qwen3-0.6b.jsonl\n```\n\nFor now, build verification only:\n\n``` bash\n$ swift build       # Build complete!\ncd ios/BenchmarkApp\n./scripts/bootstrap.sh           # downloads llama.xcframework + Anemll source\nopen BenchmarkApp.xcodeproj      # set your Team in Signing & Capabilities\n                                 # ⌘R on a connected iPhone\n```\n\nFirst launch downloads the chosen model (default: `mlx-community/gemma-4-e2b-it-4bit`\n\n, ~1.3 GB) into the app's Documents directory. Use the picker to swap.\n\n| Runtime | Adapter | Wire-up |\n|---|---|---|\n| MLX Swift | `MLXRuntime.swift` |\nSPM (`mlx-swift-lm` ) |\n| llama.cpp | `LlamaCppRuntime.swift` |\nvendored `llama.xcframework` (`bootstrap.sh` ) |\n| CoreML (swift-transformers) | `CoreMLRuntime.swift` |\nSPM (`swift-transformers` `Models` + `Generation` ) |\n| LiteRT-LM | `MediaPipeRuntime.swift` |\nSPM (`google-ai-edge/LiteRT-LM` ≥ 0.12.0, product `LiteRTLM` ); `#if canImport(LiteRTLM)` -gated |\n| ExecuTorch | `ExecuTorchRuntime.swift` |\nSPM (`pytorch/executorch` `swiftpm-*` branch) |\n| ANEMLL | `AnemllRuntime.swift` |\nlocal SPM via vendored `Anemll/` (`bootstrap.sh` ) |\n| Apple Foundation Models | `AppleFMRuntime.swift` |\nsystem framework, `#if canImport(FoundationModels)` (macOS 26 / iOS 26) |\n\nAdapters whose framework isn't present at build time are gated with `#if canImport(...)`\n\nand fall back to a clear \"not added\" error rather than failing the build.\n\nVerified in-tree:\n\n— Apple M4 Max (macOS 26)`devices/mac-m4-max.md`\n\n— MacBook Air M3, 16 GB (macOS 26)`devices/macbook-air-m3.md`\n\n— iPhone 17 Pro (iOS 26)`devices/iphone-17-pro.md`\n\n**Community devices wanted.** If you have an Apple Silicon device not listed above, the fastest way to contribute a row to `RESULTS.md`\n\nis to:\n\n- Add a\n`devices/<your-device>.md`\n\ndescribing the hardware/OS/build. - Run the app or CLI per\n.`methodology/measurement.md`\n\n- PR the resulting\n`results/raw/<device>-*.jsonl`\n\nand the updated`RESULTS.md`\n\nrows.\n\nDevices we'd love numbers for:\n\n- iPhone 15 Pro / 16 Pro / 17 Pro Max / 17 Air\n- iPad Pro M2 / M4\n- MacBook Pro M1 / M2 / M3 / M4 (Pro / Max)\n- Mac Studio Ultra (M2 Ultra / M3 Ultra)\n- Mac mini M2 / M4\n\n| Backend | Build on Mac | Run on Mac | Notes |\n|---|---|---|---|\n| MLX Swift LM | ✅ | ✅ | Native SPM macOS. The Xcode-built tool target sidesteps mlx-swift#349. |\n| llama.cpp | ✅ | ✅ | `macos-arm64_x86_64` slice in `Vendored/llama.xcframework` . CLI uses `LD_RUNPATH_SEARCH_PATHS` to resolve the framework at runtime. |\n| CoreML (CoreMLLLM) | ✅ | ✅ (some models) | macOS 15+. Models with the single-top-level `.mlpackage` layout (e.g. LFM 2.5 350M) auto-download from HF and run; the chunked / multi-`.mlpackage` repos (e.g. `mlboydaisuke/qwen3.5-0.8B-CoreML` ) need upstream `CoreMLLLM` work to load. |\n| ExecuTorch | ✅ | ⏸ | Build path is clean; current ET-community models ship SentencePiece `tokenizer.model` but ET's `hf_tokenizer.cpp` expects HF-format `tokenizer.json` . Needs a model with HF tokenizer or an ET-side SentencePiece adapter. |\n| ANEMLL | ✅ | ⏸ | Build path is clean; `swift-huggingface.HFDownloader` fails on `.mlmodelc/` directory-shaped HF repos. Needs upstream downloader work. |\n| LiteRT-LM | ✅ | ⏸ | `google-ai-edge/LiteRT-LM` v0.12.0 ships `ios-arm64` + `macos-arm64` slices, wired via SPM (product `LiteRTLM` , macOS 12+). Build path clean; M4 Max run pending. Watch the package's `-all_load` for duplicate-symbol clashes with the vendored `llama` /`executorch` static libs (fall back to scoped `-force_load` ). |\n\n**Phase 1**— repo rename, top-level SPM (`YardstickKit`\n\n+`yardstick`\n\nCLI), Mac CLI builds clean, README + device pages, methodology docs, iOS app intact.**Phase 2**— Mac CLI runs end-to-end (via Xcode-built target to sidestep mlx-swift #349), first M4 Max numbers committed to`RESULTS.md`\n\n.**Phase 2.5**— All 5 buildable backends (MLX, llama.cpp, CoreML, ExecuTorch, ANEMLL) wired into the Mac tool target; first cross-backend row (Gemma 4 E2B: MLX vs llama.cpp).**Phase 3***(in progress)*— fill remaining adapter row gaps (downloader + model-format work, mostly upstream), MacBook Air M3 + iPhone 17 Pro numbers via`[Yardstick_USER_RUNS.md](../Yardstick_USER_RUNS.md)`\n\n.**Phase 4**— quality / accuracy tasks: WER + CER (reusing`swift-transformers`\n\nWhisper normalizer), perplexity, MMLU subset. ASR + TTS adapters (WhisperKit, Apple Speech, system TTS).**Phase 5**— public results dashboard, regeneration CI, comparison plots.\n\nMIT, see [ LICENSE](/john-rocky/apple-silicon-llm-bench/blob/main/LICENSE).", "url": "https://wpnews.pro/news/show-hn-iphone-ane-holds-llm-tok-s-while-mlx-and-litert-thermal-throttle", "canonical_source": "https://github.com/john-rocky/apple-silicon-llm-bench", "published_at": "2026-06-04 05:17:10+00:00", "updated_at": "2026-06-04 05:47:39.373223+00:00", "lang": "en", "topics": ["large-language-models", "machine-learning", "artificial-intelligence", "ai-infrastructure", "ai-chips"], "entities": ["Apple", "Google", "MLX", "LiteRT", "llama.cpp", "CoreML", "ANEMLL", "Gemma"], "alternates": {"html": "https://wpnews.pro/news/show-hn-iphone-ane-holds-llm-tok-s-while-mlx-and-litert-thermal-throttle", "markdown": "https://wpnews.pro/news/show-hn-iphone-ane-holds-llm-tok-s-while-mlx-and-litert-thermal-throttle.md", "text": "https://wpnews.pro/news/show-hn-iphone-ane-holds-llm-tok-s-while-mlx-and-litert-thermal-throttle.txt", "jsonld": "https://wpnews.pro/news/show-hn-iphone-ane-holds-llm-tok-s-while-mlx-and-litert-thermal-throttle.jsonld"}}