{"slug": "show-hn-mlx-chronos-benchmark-mlx-inference-engines-on-apple-silicon", "title": "Show HN: mlx-chronos - benchmark MLX inference engines on Apple Silicon", "summary": "A new open-source tool called mlx-chronos provides a standardized benchmark suite and community leaderboard for comparing local LLM inference engines on Apple Silicon Macs. The tool runs reproducible benchmarks, generates sealed JSON results, and allows users to compare performance across different engines and Mac models.", "body_md": "Benchmark suite and community leaderboard for local LLM inference on Apple Silicon. Run a reproducible benchmark, save a sealed JSON result, and compare engines across Macs.\n\n[Overview](#overview)[Supported Engines](#supported-engines)[Quick Start](#quick-start)[CLI Reference](#cli-reference)[Configuration](#configuration)[Benchmark Protocol](#benchmark-protocol)[Leaderboard Rules](#leaderboard-rules)[Submit Results](#submit-results)[Roadmap](#roadmap)\n\n`mlx-chronos`\n\nis a standardized benchmark tool for local LLM inference engines\non Apple Silicon. It detects your Mac, runs a fixed benchmark protocol against\nan OpenAI-compatible engine endpoint, and writes structured result files for\nlocal analysis or public leaderboard submission.\n\nThe public leaderboard is available at\n[igurss.github.io/mlx-chronos](https://igurss.github.io/mlx-chronos).\n\n| Metric | Meaning | Public comparison use |\n|---|---|---|\n| TTFT cold | Time from request start to first non-empty streamed token with cache-avoiding prompts | Yes |\n| TTFT cached | Time to first token after a cache-priming call with the same prompt | Yes |\n| Request throughput | Completion tokens divided by full client-observed request time | Yes, when engine token usage is reliable |\n| Sustained throughput | Optional long throughput run for heat buildup and late-run degradation | Yes, under the sustained profile |\n| System RAM peak | Peak total Mac RAM in use during the benchmark | Yes |\n| Engine RSS | Post-warmup RSS of the engine server process when identifiable | Diagnostic only |\n| Thermal state | Start, end, worst state, samples, and affected benchmark phases when available | Context metadata |\n| Tool calling | Planned future success-rate benchmark | Not yet available |\n\n`0.3.1`\n\nsimplifies public model identity metadata to model name,\nquantization, model format, and the required model reference URL, while keeping\nthe guided workflows, timing metadata, and stricter leaderboard integrity\nchecks introduced in `0.3.0`\n\n.\n\n| Engine | Project | Notes |\n|---|---|---|\n| Ollama |\n|\n\n[jundot/omlx](https://github.com/jundot/omlx)[raullenchai/Rapid-MLX](https://github.com/raullenchai/Rapid-MLX)[waybarrios/vllm-mlx](https://github.com/waybarrios/vllm-mlx)[ml-explore/mlx-lm](https://github.com/ml-explore/mlx-lm)\n\nNoteThe engine server must already be running before`mlx-chronos run`\n\n,`mlx-chronos models`\n\n, or`mlx-chronos validate`\n\ncan query it. See[CONTRIBUTING.md]for engine setup details.\n\n```\npip install mlx-chronos\n```\n\nOptional thermal-state support through macOS Foundation/PyObjC:\n\n```\npip install \"mlx-chronos[thermal]\"\nmlx-chronos --version\nmlx-chronos upgrade\n```\n\nWhen run in an interactive terminal, `mlx-chronos`\n\nperforms a best-effort\nbackground PyPI version check. If a newer release is available, it prints a\nshort notice recommending:\n\n```\nmlx-chronos upgrade\n```\n\nSet `MLX_CHRONOS_DISABLE_UPDATE_CHECK=1`\n\nto disable the automatic check.\n\n```\nmlx-chronos engines\nmlx-chronos models --engine omlx\nmlx-chronos validate --engine omlx --model \"Qwen3.5-4B-OptiQ-4bit\"\nmlx-chronos wizard\n```\n\nThe wizard provides a terminal menu for common actions and a guided benchmark\nbuilder with engine, model, profile, token bounds, output format, cooldown,\npreflight, notes, and other run options. When the selected engine server is\nrunning, the wizard loads `/models`\n\nand lets you select a model from the exposed\nIDs, with manual entry as a fallback. Before launching a benchmark, it shows the\nequivalent `mlx-chronos run ...`\n\ncommand so the same configuration can be reused\nin scripts. You can return to the main menu from benchmark setup without\nstarting a run.\n\n```\nmlx-chronos run --engine omlx --model \"Qwen3.5-4B-OptiQ-4bit\"\n```\n\nResults are written to `results/local/`\n\nby default.\n\n```\n# Write both JSON and Markdown outputs\nmlx-chronos run --engine omlx --model \"Qwen3.5-4B-OptiQ-4bit\" --format all\n\n# Choose a custom output directory\nmlx-chronos run --engine omlx --model \"Qwen3.5-4B-OptiQ-4bit\" --output-dir ~/Desktop/benchmarks\n\n# Request throughput output token bounds for local experiments\nmlx-chronos run --engine omlx --model \"Qwen3.5-4B-OptiQ-4bit\" --max-tokens 100 --min-tokens 80\n\n# Run the longer heat/throttling-sensitive sustained profile\nmlx-chronos run --engine omlx --model \"Qwen3.5-4B-OptiQ-4bit\" --profile sustained\n\n# Enforce cooldown after a recent run in the same output directory\nmlx-chronos run --engine omlx --model \"Qwen3.5-4B-OptiQ-4bit\" --cooldown-seconds 300\n\n# Fail fast with an extra model access probe before measured work starts\nmlx-chronos run --engine omlx --model \"Qwen3.5-4B-OptiQ-4bit\" --preflight\n\n# Include a model reference URL, required for public leaderboard submissions\nmlx-chronos run --engine omlx \\\n  --model \"Qwen3.5-4B-OptiQ-4bit\" \\\n  --model-url \"https://huggingface.co/mlx-community/Qwen3.5-4B-OptiQ-4bit\"\n```\n\n| Command | Purpose |\n|---|---|\n`mlx-chronos --version` |\nPrint the installed package version |\n`mlx-chronos wizard` |\nOpen an interactive menu for common commands and guided benchmark setup |\n`mlx-chronos upgrade` |\nCheck PyPI and upgrade the current Python environment if a newer release exists |\n`mlx-chronos engines` |\nList supported engines and local installed/running status |\n`mlx-chronos models --engine <name>` |\nList model IDs exposed by a running engine server |\n`mlx-chronos validate --engine <name> --model <model>` |\nValidate hardware, engine, server, and optional model access |\n`mlx-chronos run --engine <name> --model <model>` |\nRun a benchmark and save local result files |\n`mlx-chronos submit --file <result.json> --dry-run` |\nValidate whether a result is publishable |\n`mlx-chronos submit --file <result.json>` |\nSend a validated result to the maintainer inbox |\n\n| Setting | Example | What it changes |\n|---|---|---|\n`MLX_CHRONOS_<ENGINE>_PORT` |\n`MLX_CHRONOS_OMLX_PORT=8002` |\nOverrides an engine server port |\n`MLX_CHRONOS_CACHED_TTFT_RATIO` |\n`MLX_CHRONOS_CACHED_TTFT_RATIO=0.8` |\nSets the cached-TTFT warning threshold |\n`MLX_CHRONOS_DISABLE_UPDATE_CHECK` |\n`MLX_CHRONOS_DISABLE_UPDATE_CHECK=1` |\nDisables automatic background update checks |\n`MLX_CHRONOS_SUBMIT_ENDPOINT` |\n`https://example.test/form` |\nOverrides the maintainer inbox endpoint |\n\nDefault engine ports:\n\n| Engine | Default port |\n|---|---|\n| oMLX | `8000` |\n| Rapid-MLX | `8001` |\n| vllm-mlx | `8000` |\n| mlx-lm | `8080` |\n| Ollama | `11434` |\n\noMLX and vllm-mlx both default to port `8000`\n\n. To avoid mislabeling results,\nmlx-Chronos checks the oMLX listener process with `lsof`\n\n; if that process cannot\nbe inspected, oMLX validation may fail even when `/v1/models`\n\nresponds.\n\n`mlx-chronos run`\n\nexecutes a fixed protocol against the running engine. The JSON\nresult records exact prompt text, token bounds, benchmark profile, timing\nmetadata, hardware metadata, and an integrity seal.\n\n| Phase | What happens |\n|---|---|\n| Hardware detection | Captures chip, machine model, memory, macOS, Python, architecture, battery state, Low Power Mode, and thermal context when available |\n| Warmup | Uses a separate prompt so same-run prefix/KV cache hits do not remove throughput prefill work |\n| Cold TTFT | Uses unique prompts inside the run to avoid same-run cache hits |\n| Cached TTFT | Primes one fixed prompt, then measures consecutive cached trials |\n| Throughput | Uses fixed protocol prompts and deterministic generation parameters |\n| RAM and thermal tracking | Samples system RAM, diagnostic engine RSS, phase timings, and thermal state where available |\n| Result sealing | Adds a tamper-evident integrity seal for public-submission validation |\n\n- Requests use deterministic generation parameters:\n`temperature=0.0`\n\nand`top_p=1.0`\n\n. - Throughput is end-to-end request throughput, not pure decode speed. It includes request overhead, prefill, and decode.\n- Timed TTFT and throughput requests are never retried. A transient request failure invalidates the run instead of becoming part of a published timing.\n- Cached TTFT is recorded only after cache priming completes successfully.\n- Decode throughput records first-content-to-stream-end elapsed time so the value can be reconstructed from raw completion-token counts.\n- Throughput prompts intentionally vary to reduce cache artifacts, so run standard deviation includes workload variation plus system and engine noise.\n- If an engine cannot provide reliable\n`usage.completion_tokens`\n\n, the run falls back to a local estimate and is marked as not leaderboard-comparable. - p95 is reported only when at least 20 trials are available.\n- The default baseline run uses 5 trials. The maximum prompt pool supports 30 unique cold and throughput prompts.\n\n`--profile sustained`\n\nruns one long throughput trial with `max_tokens=1000`\n\nby\ndefault and records progress samples every 100 generated output units.\nIntermediate samples are estimates when the stream only reports exact token\nusage at the end.\n\nIf the sustained run observes a thermal-state change or non-nominal thermal state, result metadata includes a sustained throttling warning. The warning compares early and late progress-window averages, not a single first/last sample.\n\nBefore each run, mlx-Chronos checks the latest prior JSON result in the same\noutput directory. The elapsed time is saved as\n`meta.elapsed_since_last_benchmark_seconds`\n\n.\n\nUse `--cooldown-seconds`\n\nto enforce a pause before starting another run. The\ndefault recent-run warning threshold is 300 seconds.\n\nFor a fuller explanation, see\n[docs/methodology.md](https://github.com/igurss/mlx-chronos/blob/main/docs/methodology.md).\n\nLocal runs are intentionally flexible. You can change trial count, profile, output token bounds, cooldown, connection mode, notes, and other parameters for your own diagnostics.\n\nPublic leaderboard submissions are stricter so rows remain comparable.\n\n| Profile | Trials | `max_tokens` |\nMinimum generated output | `min_tokens` |\n|---|---|---|---|---|\n| Baseline | 5 | 100 | 80 tokens | Not allowed |\n| Sustained | 1 | 1000 | 800 tokens | Not allowed |\n\n- Throughput must use the engine response's\n`usage.completion_tokens`\n\n. - The result must include\n`model.reference_url`\n\n, a link to the model used. - The inference engine version must be known;\n`engine.version=unknown`\n\nis not accepted for public comparison. - Hardware must report an Apple M-series chip,\n`arm64`\n\n, and a valid macOS version; timestamps may not be more than 10 minutes in the future. - All warmup calls must complete successfully (\n`warmup_failures=0`\n\n). - System RAM, engine RSS, and continuous Foundation thermal monitoring must complete without sampling errors.\n- macOS Low Power Mode must be disabled.\n- Decode throughput must include reconstructible raw decode elapsed time.\n- The JSON must pass\n`mlx-chronos submit --dry-run`\n\n. - The result must include a valid integrity seal.\n- The archive rejects duplicate integrity digests and duplicate run identities.\n- Custom token bounds, fallback token estimates, custom public-profile trial counts, short-output runs, and Low Power Mode runs are valid local records but are not accepted into the public leaderboard.\n\nResult JSON also contains internal benchmark-protocol labels used by validators\nto detect incompatible result formats. Treat labels such as `1`\n\n, `2`\n\n, and `3`\n\nas implementation compatibility markers, not public protocol release versions.\nModel reference URLs point to the model page used for the run. Model pages can\nchange over time when maintainers update files or tags.\nLeaderboard comparisons keep model name, quantization, format, and model\nreference URL separate so distinct variants are not grouped together.\n\n-\nRun\n\n`mlx-chronos run`\n\non your Mac. -\nFind the generated JSON in\n\n`results/local/`\n\n. -\nValidate it locally:\n\n```\nmlx-chronos submit --file results/local/your-result.json --dry-run\n```\n\n-\nCopy the checked JSON into\n\n`results/submitted/`\n\nwith a clear filename. -\nOpen a pull request with only that JSON file changed.\n\n-\nGitHub Actions labels the PR as\n\n`result-submission`\n\n, validates schema and integrity, and the maintainer reviews it before merge.\n\nWarningDo not edit submitted JSON by hand after the run. Public submissions include an`integrity`\n\nseal over the canonical result payload; changing any benchmark field invalidates that seal.\n\nIf opening a PR is inconvenient, send a validated result directly:\n\n```\nmlx-chronos submit --file results/local/your-result.json\n```\n\nMaintainers can override the inbox endpoint with `--endpoint`\n\nor\n`MLX_CHRONOS_SUBMIT_ENDPOINT`\n\n.\n\nSee [CONTRIBUTING.md](https://github.com/igurss/mlx-chronos/blob/main/CONTRIBUTING.md)\nfor detailed contributor instructions.\n\n- Core benchmark runner with repeated trials, warmup, cache priming, and phase-separated metrics\n- Engine support for oMLX, Rapid-MLX, vllm-mlx, mlx-lm, and Ollama\n- Hardware detection for chip, machine model, memory, macOS, Python, architecture, and thermal state\n- Strict JSON schema validation with raw-trial consistency checks\n- Continuous system RAM peak sampling, with post-warmup engine RSS kept as a diagnostic field\n- Preflight validation for engine, server, and model access\n- GitHub Actions validation for submitted results\n- PR-based result submissions with automatic\n`result-submission`\n\n,`code`\n\n, and`documentation`\n\nlabels - GitHub Pages leaderboard with model/chip/RAM engine comparison and configurable raw-data columns\n- JSON and Markdown result export\n-\n`mlx-chronos submit`\n\nfor sending validated JSON results to the maintainer inbox - Warnings for battery mode, Low Power Mode, non-nominal thermal state, and unavailable thermal state\n- Integration tests against mock OpenAI-compatible servers\n- Larger fixed cold-prompt pool with optional p95 reporting for larger runs\n- Request-throughput timing metadata and client-observed streaming decode throughput\n- Phase timing metadata and lightweight continuous thermal monitoring\n- Sustained benchmark profile, cooldown metadata, and strict local-vs-public leaderboard policy\n- Public submission trust model with lightweight anti-spoofing checks\n- External contributor workflow for code PRs and leaderboard result submissions\n- CLI update notifications and\n`mlx-chronos upgrade`\n\n- Evaluate a clearer TTFT naming model without breaking the v0.1 JSON contract\n- Add tool-calling success-rate benchmarks\n- Collect more results from M3, M4, and M5 systems\n\nApache 2.0. See [LICENSE](https://github.com/igurss/mlx-chronos/blob/main/LICENSE).", "url": "https://wpnews.pro/news/show-hn-mlx-chronos-benchmark-mlx-inference-engines-on-apple-silicon", "canonical_source": "https://github.com/igurss/mlx-chronos", "published_at": "2026-06-25 16:32:19+00:00", "updated_at": "2026-06-25 16:44:57.787911+00:00", "lang": "en", "topics": ["machine-learning", "developer-tools", "ai-infrastructure"], "entities": ["mlx-chronos", "Apple Silicon", "Ollama", "mlx-lm", "Rapid-MLX", "vllm-mlx", "omlx"], "alternates": {"html": "https://wpnews.pro/news/show-hn-mlx-chronos-benchmark-mlx-inference-engines-on-apple-silicon", "markdown": "https://wpnews.pro/news/show-hn-mlx-chronos-benchmark-mlx-inference-engines-on-apple-silicon.md", "text": "https://wpnews.pro/news/show-hn-mlx-chronos-benchmark-mlx-inference-engines-on-apple-silicon.txt", "jsonld": "https://wpnews.pro/news/show-hn-mlx-chronos-benchmark-mlx-inference-engines-on-apple-silicon.jsonld"}}