On-device LLM benchmark for Apple Silicon β iPhone Β· iPad Β· Mac.
A neutral, reproducible benchmark for running local LLMs (and, in time, ASR / TTS) on Apple Silicon. Compares MLX Swift, llama.cpp, CoreML (swift-transformers), LiteRT-LM, ExecuTorch, ANEMLL β and Apple's own Foundation Models β under real device constraints, not just tok/s
on a server.
Repo:
apple-silicon-llm-bench
Β· CLI/brand:yardstick
. Started life asios-llm-benchmark
β iPhone is still the headline target, now measured alongside iPad and Mac.
Real LLM inference on a phone β on-device, no server. iPhone 17 Pro, 4-bit, short-chat (128 tokens), median of 3 cold runs. The winning runtime is model-dependent β and the upset is on Gemma.
Decode throughput β tok/s, higher is better (π = winner):
| Model (4-bit, n=3) | π΄ LiteRT-LM | π£ MLX-Swift | π΅ llama.cpp | π CoreML/ANE |
|---|---|---|---|---|
| Gemma 4 E2B | 55.4 π | |||
| 47.5 | 37.8 | 33.4 | ||
| Qwen 3.5 2B | β | 61.2 π | ||
| 39.1 | 27.9 |
Peak memory β MB, lower is better (π = winner):
| Model (4-bit, n=3) | π΄ LiteRT-LM | π£ MLX-Swift | π΅ llama.cpp | π CoreML/ANE |
|---|---|---|---|---|
| Gemma 4 E2B | 641 π | |||
| 2,900 | 3,156 | 1,187 | ||
| Qwen 3.5 2B | β | 1,279 | 1,479 | 241 π |
The upset β Gemma 4 E2B: Google'sLiteRT-LM(INT4-QAT, GPU, its native.litertlm
) beats MLX-Swift on decodeand uses ~4.5Γ less memory (641 MB vs 2,900). The purpose-built runtime wins on its own format.MLX-Swift wins Qwen 3.5 2B decodeβ 61 vs 39 tok/s. (LiteRT-LM has no Qwen entry β its catalog is Gemma-only.)** CoreML / ANE is the memory champion**β Qwen 3.5 2B in just** 241 MB**(~5Γ leaner than MLX's 1,279) via chunked-MLKV on the Neural Engine β but it's the** slowest decode**(ANE trades throughput for footprint), same story as on M4 Max.** ANE is near-parity with the desktop:CoreML Gemma 4 E2B does 33 tok/s on iPhone vs 32.5 on M4 Max β same silicon family. The GPUruntimes pay the real on-device tax: ~4β5Γ slower than M4 Max (Qwen 3.5 2B β 61 tok/s vs 292). Counting:**MLX / llama.cpp / LiteRT-LM report exact tokenizer tokens (LiteRT-LM viagetBenchmarkInfo
); CoreML/ANE counts streamed pieces (β tokens). LiteRT-LM runs to EOS (no per-call cap β ~458-tok reply vs the others' 128 budget); decode tok/s is a rate, so the head-to-head holds.Fully automated, side-loaded viadevicectl
headless mode β nothing typed on the phone, same methodology as the desktop rows.Coming next: Apple Foundation Models, more models and more iPhones / iPads.One row is a great PR.
How the LiteRT-LM row was measured:google-ai-edge/LiteRT-LM
0.12.0 runninglitert-community/gemma-4-E2B-it.litertlm
(INT4-QAT) on the MetalGPUbackend, via the in-tree[adapter β same headless harness + prompt as every other row (3 cold runs, median). Token counts and tok/s come from]MediaPipeRuntime
LiteRT-LM's own benchmark counters(Conversation.getBenchmarkInfo
), so they're exact, not estimated. It generates to EOS (no per-call output cap in the API), so its token count is the model's full reply rather than the 128-token budget β decode tok/s is a rate and stays comparable; memory is exact process RSS. LiteRT-LM is vendored as alocal SwiftPM package(scripts/bootstrap.sh
clones it withGIT_LFS_SKIP_SMUDGE=1
; the released package trips SwiftPM's unsafe-flags rule via its-all_load
).
How the CoreML/ANE rows were measured:john-rocky/CoreML-LLM
on the Neural Engine (computeUnits: .cpuAndNeuralEngine
) β Gemma 4 E2B via the chunked.mlmodelc
path, Qwen 3.5 2B viaQwen35MLKVGenerator
(chunked MLKV, hence the 241 MB). Decode counts streamed pieces (β tokens); first-load ANE compilation makes its load time high (and it's the lowest-throughput runtime β the ANE trades speed for memory).Decode tok/s is the headline number; the full per-run audit (prefill, TTFT, inter-token jitter, memory) lives in
[.]RESULTS.md
The table above is cold-burst speed. Run the same model continuously and it flips: the GPU runtimes (MLX, LiteRT-LM) heat up and throttle 50%+ within ~60 s, while the ANE barely moves β it draws ~half the power, so it heats slowly and the SoC doesn't throttle it.
| Gemma 4 E2B, iPhone 17 Pro | Burst tok/s | Sustained (10 min) | Retained |
|---|---|---|---|
| CoreML / ANE | |||
| 33 | 22 | ||
| 67% | |||
| MLX / GPU | 48 | 18 | 38% |
| LiteRT-LM / GPU | 56 | 27 | 48% |
Two independent GPU runtimes collapsing the same way is a GPU-thermal property of the phone, not a runtime quirk. MLX ends up below the ANE; LiteRT keeps only a slim lead after shedding half its speed. The GPU wins the sprint; the ANE wins the marathon β and it frees the GPU for the rest of the app.
Method: 600 s continuous generation, cold (
nominal
) start, unplugged, tg128; decode rate from a rolling window. Raw JSONL inresults/raw/iphone17pro-*-energy-tg128.jsonl
; redraw with[(curves table via]scripts/throttle_chart.py
[). LiteRT-LM has no output-token cap (longer per-call) and that run started at]scripts/throttle_curve.py
fair
thermal; CoreML-LLM uses sliding-window attention (bounded context), part of why its decode stays flat.
The same harness on a laptop-class chip, for scale. No runtime wins everything here β each optimises a different corner of the throughput / memory / energy / streaming box:
mlx-swift wins decode throughput on every cell measured (1.4Γβ1.8Γ over llama.cpp after early-2026 kernel updates).Apple Foundation Models is 2Γ more energy-efficient per token than the GPU-backed runtimes, 4Γ more than CoreML/ANE.CoreML / ANE wins peak memory (chunked MLKV) but is the slowestandthe worst on J/token.llama.cpp sits in the middle on speed and energy β no axis it wins, no axis it loses badly.
Tables for the exact numbers live below. |
Regenerate after adding rows: python scripts/generate_charts.py
.
One device, four runtimes, multiple models. Decode tok/s is the primary headline number; the full table (prefill, TTFT, peak memory, per-run audit trail) lives in
[. Read the]RESULTS.md
[Headline observations]section before drawing conclusions β the runtime ranking ismodel-size-dependent.
| Logical model | Params | n | mlx-swift (Q4) | llama.cpp (Q4_K_M) | coreml-llm | litert-lm (.litertlm) |
|---|---|---|---|---|---|---|
| Qwen 2.5 0.5B | 0.5 B | 3 | 531.1 | |||
| 297.1 | 181.2 (FP16) | n/a | ||||
| Qwen 3.5 0.8B | 0.8 B | 3 | 421.1 | |||
| 201.1 | 58.2 (INT8) | n/a | ||||
| Qwen 3.5 2B | 2 B | 3 | 291.9 | |||
| 149.7 | 35.0 (INT8) | n/a | ||||
| Gemma 4 E2B | 2 B | 3 | 185.4 | |||
| 119.2 | 32.5 (INT4 palettized) | pending | ||||
| Gemma 4 E4B | 4 B | 3 | 113.5 | |||
| 80.5 | not run | |||||
| pending |
litert-lm
column:= adapter wired againstpendinggoogle-ai-edge/LiteRT-LM
v0.12.0, M4 Max run not yet captured (see[/]RESULTS.md
Yardstick_USER_RUNS.md
).n/a= LiteRT-LM's catalog is Gemma-only (.litertlm
), so the Qwen rows have no entry. For reference, Google's E2B model card reports 56.5 tok/s on iPhone 17 Pro GPU β a vendor figure on a different device, not an M4 Max Yardstick measurement.
β MLX-Swift now wins decode on every cell β 1.4Γβ1.8Γ over llama.cpp β after upstream mlx-swift-lm
shipped Qwen + Gemma kernel updates in early 2026 (the Qwen rows roughly tripled vs. the snapshot captured before those landed). The old "llama.cpp Metal always wins small-model decode" rule is no longer true on M4 Max; re-measure before quoting it. CoreML / ANE is the slowest of the three on every cell, in exchange for the dramatic memory savings shown below.
The decode-tok/s table above hides the memory side. Same models, looking at peak working-set instead:
| Logical model | Params | mlx-swift | llama.cpp | coreml-llm | litert-lm |
|---|---|---|---|---|---|
| Qwen 2.5 0.5B | 0.5 B | 390 | |||
| 538 | 962 | n/a | |||
| Qwen 3.5 0.8B | 0.8 B | 600 | |||
| 752 | 221 (INT8) | n/a | |||
| Qwen 3.5 2B | 2 B | 1223 | 1443 | 230 (INT8) | |
| n/a | |||||
| Gemma 4 E2B | 2 B | 2829 | 3212 | 1036 | |
| pending | |||||
| Gemma 4 E4B | 4 B | 4376 | |||
| 5150 | β | pending |
β "CoreML/ANE wins memory" is true once the chunked MLKV layout kicks in. At 0.5 B params MLX-Swift is still smaller (413 MB vs CoreML's 959 MB monolithic FP16); from 0.8 B onward, CoreML's chunked MLKV path (Qwen35MLKVGenerator
: mmap'd embed sidecar + on-demand ANE chunks) holds the process RSS roughly flat β 206 MB at 0.8 B, 215 MB at 2 B β while MLX and llama.cpp scale linearly with parameter count.
The number nobody else publishes: how many joules does each backend burn per generated token? Captured via scripts/measure_energy.py which co-runs
powermetrics
(whole-system, package power = CPU + GPU + ANE) and clips the sample window to the bench's reported active time.The ANE path draws ~half the GPU path's package power at full decode (12.7 W vs ~24.7 W) β the same power gap that makes the GPU runtimes thermally throttle on iPhone while the ANE holds its rate (see the sustained-throttle section above).
| Runtime | Avg pkg power (W) | Energy / 512-tok run (J) | J / token |
|---|---|---|---|
| apple-fm (system model) | |||
| 7.6 | 67.4 | 0.11 | |
| mlx-swift (4-bit MLX) | 24.7 | 123.0 | 0.24 |
| llama.cpp (Q4_K_M, GGUF) | 24.5 | 126.3 | 0.25 |
| coreml-llm (INT4 palettized, ANE) | 12.7 | 244.9 | 0.48 |
β Energy ranking inverts the decode-tok/s ranking. Apple FM is 2Γ more efficient per token than the GPU-backed runtimes despite producing tokens at ~half the rate. CoreML/ANE has the lowest instantaneous power (12.7 W) but is the worst J/tok at 4Γ Apple FM, because the slower decode (32 tok/s) keeps the package powered up much longer. MLX-Swift and llama.cpp draw the most W (GPU) but produce tokens fast enough to break even at ~0.24 J/tok. Whole-system measurement includes the idle baseline so all four numbers slightly inflate per-token energy β useful for ranking, not for absolute attribution. iPhone energy uses the 1 %-battery-step API instead (different methodology, similar table shape).
llama.cpp (Q4_K_M GGUF, M4 Max, short-chat)
| Model | Params | n | TTFT (ms) | Decode tok/s | Peak Mem (MB) |
|---|---|---|---|---|---|
| Qwen 2.5 0.5B | 0.5 B | 3 | 22 | 297.1 | 538 |
| Qwen 3.5 0.8B | 0.8 B | 3 | 22 | 201.1 | 752 |
| Llama 3.2 1B | 1.0 B | 3 | 25 | 285.9 | |
| 1022 | |||||
| Qwen 3.5 2B | 2 B | 3 | 29 | 149.7 | 1443 |
| Gemma 4 E2B | 2 B | 3 | 41 | 119.2 | 3212 |
| Gemma 4 E4B | 4 B | 3 | 62 | 80.5 | 5150 |
mlx-swift (Q4 / MLX, M4 Max, short-chat)
| Model | Params | n | TTFT (ms) | Decode tok/s | Peak Mem (MB) |
|---|---|---|---|---|---|
| Qwen 2.5 0.5B | 0.5 B | 3 | 21 | 531.1 | |
| 390 | |||||
| Qwen 3.5 0.8B | 0.8 B | 3 | 36 | 421.1 | |
| 600 | |||||
| Qwen 3.5 2B | 2 B | 3 | 42 | 291.9 | |
| 1223 | |||||
| Gemma 4 E2B | 2 B | 3 | 68 | 185.4 | 2829 |
| Gemma 4 E4B | 4 B | 3 | 90 | 113.5 | 4376 |
coreml-llm (CoreML / ANE, M4 Max, short-chat)
| Model | Params | n | TTFT (ms) | Decode tok/s | Peak Mem (MB) |
|---|---|---|---|---|---|
| LFM 2.5 350M | 0.35 B | 1 | 383 | 58.9 | 98 |
| Qwen 2.5 0.5B | 0.5 B | 3 | 171 | 181.2 | 962 |
| Qwen 3.5 0.8B | 0.8 B | 3 | 405 | 58.2 | 221 |
| Qwen 3.5 2B | 2 B | 3 | 665 | 35.0 | 230 |
| Gemma 4 E2B | 2 B | 3 | 525 | 32.5 | 1036 |
β CoreML/ANE trades throughput for memory: 3-8Γ less peak working set than MLX-Swift / llama.cpp at the same model size, at ~half the decode tok/s. The Qwen 3.5 0.8B / 2B numbers come from the dedicated Qwen35MLKVGenerator
(ANE chunked decode, KV in MLState
β public API since CoreML-LLM v1.9.0
), not the generic CoreMLLLM.load(from:)
path.
Apple FM is a single pre-installed model, so it can't share a "logical model" row with the open-weight runtimes above. It earns its own line as a reference point β the number to beat when "just use the system model" is the alternative.
| Runtime | Model | n | TTFT (ms) | Decode tok/s | Peak Mem (MB, in-process) |
|---|---|---|---|---|---|
| apple-fm | Apple Foundation Model (default, ~3 B params est.) | 3 | 269 | 85.2 | 27 |
Caveats β read before comparing.
Tokens are estimated(utf8.count / 4
) becauseFoundationModels
does not expose the tokenizer. Treat decode tok/s as Β±20%; the other runtimes report counts from their actual tokenizer.Peak memory is in-process only. The model lives in Apple's system process, not ours, so 27 MB is the harness overhead β not the true model footprint. Use Activity Monitor /powermetrics
for the system-wide picture.Quant is Apple-internal. Community reverse-engineering puts it at ~2-bit base weights + 4-bit task adapters; Apple has not published numbers. Don't read the decode tok/s as a comment on any specific quant choice.
Full results β by model, by runtime, full per-run audit trail β
This table is the repo. The easiest possible contribution is one new row. All three of these are equally valuable:
A new device. Run the existing models on your iPhone / iPad / Mac. Tooling in. The "Devices wanted" list at the bottom ofYardstick_USER_RUNS.md
is the shortlist.RESULTS.md
A new model. Drop the model id into thefor the runtime that can load it.ModelCatalog
A new runtime. Wire it up infollowing theios/BenchmarkApp/Sources/Runtimes/
LLMRuntime
protocol; the harness will pick it up.
Workflow once you have the build set up:
for run in 1 2 3; do
yardstick run --task short-chat \
--runtime mlx-swift \
--model <id-or-hf-repo> \
--output results/raw/<device>-<runtime>-<model>-short-chat-run${run}.jsonl
done
python scripts/render_results.py
CI runs python scripts/render_results.py --check
on every PR β it fails if the JSONLs and the tables disagree, so the human-edited section of RESULTS.md cannot drift out of sync with the raw data.
Full step-by-step (build, model picker, device-specific gotchas) lives in CONTRIBUTING.md.
Per (runtime, model, device, build)
tuple:
Speedβ TTFT, prefilltok/s
, decodetok/s
, sustained-decode drift over 512+ tokens.Memoryβ baseline, peak during decode, after-generation.** Thermal**β initial / peak / final state across the run.** Jitter**β inter-token latencyp50
/p95
/p99
ms, captured from the gap between consecutive.chunk
events. Surfaces the worst-case stall a chat UI will perceive even when the average decode rate looks smooth.Energyβ joules per token. iOS uses the 1%-battery-step API; Mac usesscripts/measure_energy.py
(wrapspowermetrics
, see "Optional: capture Mac energy" below).Lifecycleβ survives background β foreground, cancellation latency, streaming.** Quality***(roadmap)*β WER / CER for ASR, perplexity / MMLU for LLM, byte-identical comparison vs Python references.
Methodology lives under methodology/. The numbers we publish follow
methodology/fairness-rules.md
sudo python scripts/measure_energy.py run \
--task short-chat --runtime mlx-swift \
--model mlx-community/gemma-4-e2b-it-4bit \
--output results/raw/<device>-<runtime>-<model>-<task>-energy.jsonl
The wrapper starts powermetrics
in the background, runs yardstick
,
stops powermetrics
, then patches the JSONL with energyJoules
,
averagePackagePowerW
, and energyJoulesPerToken
. Numbers are whole-system β run on an idle desktop and use them to compare runtimes on the same Mac, not Macs to each other.
The iOS app's History β β’β’β’ β Export all (JSONL) sheet hands you a single newline-delimited file. AirDrop it to your Mac, then:
python scripts/import_ios_export.py ~/Downloads/yardstick-*.jsonl
python scripts/render_results.py
The import script splits the bundle into one
results/raw/<device>-<runtime>-<model>-<task>-runN.jsonl
per row,
re-keying the device label so render_results.py
recognises it.
Yardstick/
βββ Package.swift SPM: YardstickKit library + `yardstick` Mac CLI
βββ apple/
β βββ YardstickCLI/ Mac command-line runner
βββ ios/
β βββ BenchmarkApp/ On-device iOS app (`.xcodeproj`)
βββ runtimes/ Per-runtime notes (adapters, gotchas, version pins)
βββ devices/ Per-device pages (chip, RAM, OS, build, signing)
βββ methodology/ How we measure each axis fairly
βββ models/ Curated model catalog
βββ prompts/ Standardized prompts per task
βββ results/
βββ raw/ JSONL dumps per run
βββ (summary tables generated into RESULTS.md)
Current status (May 2026): SPM build is clean. Runtime is blocked by[β the MLX Metal kernel bundle isn't emitted by]ml-explore/mlx-swift#349
swift build
from a downstream package, soswift run yardstick run β¦
exits withFailed to load the default metallib
. The same workaround applies tomlx-swift-examples/llm-tool
(its README says "Build the llm-tool scheme in Xcode"). A macOS app target that wraps the CLI through Xcode's Metal toolchain is queued as Phase 2.
When the Phase-2 macOS target lands, this is the intended shape:
$ yardstick list
$ yardstick run --task short-chat \
--runtime mlx-swift \
--model mlx-community/Qwen3-0.6B-4bit \
--output results/raw/m4max-mlx-qwen3-0.6b.jsonl
For now, build verification only:
$ swift build # Build complete!
cd ios/BenchmarkApp
./scripts/bootstrap.sh # downloads llama.xcframework + Anemll source
open BenchmarkApp.xcodeproj # set your Team in Signing & Capabilities
First launch downloads the chosen model (default: mlx-community/gemma-4-e2b-it-4bit
, ~1.3 GB) into the app's Documents directory. Use the picker to swap.
| Runtime | Adapter | Wire-up |
|---|---|---|
| MLX Swift | MLXRuntime.swift |
|
SPM (mlx-swift-lm ) |
||
| llama.cpp | LlamaCppRuntime.swift |
|
vendored llama.xcframework (bootstrap.sh ) |
||
| CoreML (swift-transformers) | CoreMLRuntime.swift |
|
SPM (swift-transformers Models + Generation ) |
||
| LiteRT-LM | MediaPipeRuntime.swift |
|
SPM (google-ai-edge/LiteRT-LM β₯ 0.12.0, product LiteRTLM ); #if canImport(LiteRTLM) -gated |
||
| ExecuTorch | ExecuTorchRuntime.swift |
|
SPM (pytorch/executorch swiftpm-* branch) |
||
| ANEMLL | AnemllRuntime.swift |
|
local SPM via vendored Anemll/ (bootstrap.sh ) |
||
| Apple Foundation Models | AppleFMRuntime.swift |
|
system framework, #if canImport(FoundationModels) (macOS 26 / iOS 26) |
Adapters whose framework isn't present at build time are gated with #if canImport(...)
and fall back to a clear "not added" error rather than failing the build.
Verified in-tree:
β Apple M4 Max (macOS 26)devices/mac-m4-max.md
β MacBook Air M3, 16 GB (macOS 26)devices/macbook-air-m3.md
β iPhone 17 Pro (iOS 26)devices/iphone-17-pro.md
Community devices wanted. If you have an Apple Silicon device not listed above, the fastest way to contribute a row to RESULTS.md
is to:
- Add a
devices/<your-device>.md
describing the hardware/OS/build. - Run the app or CLI per
.methodology/measurement.md
- PR the resulting
results/raw/<device>-*.jsonl
and the updatedRESULTS.md
rows.
Devices we'd love numbers for:
- iPhone 15 Pro / 16 Pro / 17 Pro Max / 17 Air
- iPad Pro M2 / M4
- MacBook Pro M1 / M2 / M3 / M4 (Pro / Max)
- Mac Studio Ultra (M2 Ultra / M3 Ultra)
- Mac mini M2 / M4
| Backend | Build on Mac | Run on Mac | Notes |
|---|---|---|---|
| MLX Swift LM | β | β | Native SPM macOS. The Xcode-built tool target sidesteps mlx-swift#349. |
| llama.cpp | β | β | macos-arm64_x86_64 slice in Vendored/llama.xcframework . CLI uses LD_RUNPATH_SEARCH_PATHS to resolve the framework at runtime. |
| CoreML (CoreMLLLM) | β | β (some models) | macOS 15+. Models with the single-top-level .mlpackage layout (e.g. LFM 2.5 350M) auto-download from HF and run; the chunked / multi-.mlpackage repos (e.g. mlboydaisuke/qwen3.5-0.8B-CoreML ) need upstream CoreMLLLM work to load. |
| ExecuTorch | β | βΈ | Build path is clean; current ET-community models ship SentencePiece tokenizer.model but ET's hf_tokenizer.cpp expects HF-format tokenizer.json . Needs a model with HF tokenizer or an ET-side SentencePiece adapter. |
| ANEMLL | β | βΈ | Build path is clean; swift-huggingface.HFDown fails on .mlmodelc/ directory-shaped HF repos. Needs upstream down work. |
| LiteRT-LM | β | βΈ | google-ai-edge/LiteRT-LM v0.12.0 ships ios-arm64 + macos-arm64 slices, wired via SPM (product LiteRTLM , macOS 12+). Build path clean; M4 Max run pending. Watch the package's -all_load for duplicate-symbol clashes with the vendored llama /executorch static libs (fall back to scoped -force_load ). |
Phase 1β repo rename, top-level SPM (YardstickKit
+yardstick
CLI), Mac CLI builds clean, README + device pages, methodology docs, iOS app intact.Phase 2β Mac CLI runs end-to-end (via Xcode-built target to sidestep mlx-swift #349), first M4 Max numbers committed toRESULTS.md
.Phase 2.5β All 5 buildable backends (MLX, llama.cpp, CoreML, ExecuTorch, ANEMLL) wired into the Mac tool target; first cross-backend row (Gemma 4 E2B: MLX vs llama.cpp).Phase 3*(in progress)*β fill remaining adapter row gaps (down + model-format work, mostly upstream), MacBook Air M3 + iPhone 17 Pro numbers via[Yardstick_USER_RUNS.md](../Yardstick_USER_RUNS.md)
.Phase 4β quality / accuracy tasks: WER + CER (reusingswift-transformers
Whisper normalizer), perplexity, MMLU subset. ASR + TTS adapters (WhisperKit, Apple Speech, system TTS).Phase 5β public results dashboard, regeneration CI, comparison plots.
MIT, see LICENSE.