{"slug": "show-hn-kitchen-rush-overcooked-inspired-llm-tool-calling-benchmark", "title": "Show HN: Kitchen Rush, Overcooked inspired LLM tool calling benchmark", "summary": "Kitchen Rush, a new benchmark for evaluating large language model tool-calling, measures both accuracy and latency by simulating an Overcooked-style kitchen where thinking time directly impacts game performance. The benchmark produces a single score, KR, that combines speed and correctness, with separate leaderboards for different latency budgets.", "body_md": "**An agent tool-calling benchmark where latency matters as much as intelligence.**\n\nMost tool-calling benchmarks (BFCL, τ-bench, ToolSandbox, AppWorld) check *whether* a model\nmakes the right calls — and the world politely waits while it thinks. That's fine for offline\ntasks. But if you're building a voice assistant, a live-ops agent, or anything realtime, you\ncare about two things at once: **does the model do the right thing, and does it do it fast\nenough?** A model that finds the perfect answer after thirty seconds of reasoning is, for you,\nthe wrong model.\n\nKitchen Rush measures both at once, by construction: the time a model spends thinking is\nconverted into game time that passes *before* its actions land. While the model deliberates,\nfood keeps cooking, food burns, and order deadlines slip away. Speed and accuracy aren't two\ncharts you squint at — they're one score, experienced the way a deployment would experience\nthem.\n\nThe model plays a chef in an [Overcooked](https://github.com/HumanCompatibleAI/overcooked_ai)-style\nkitchen. Orders stream in (burgers, soups, ramen…), and the model fulfils them with ordinary\n**native function calls** — `collect`\n\n, `chop`\n\n, `cook`\n\n, `plate`\n\n, `serve`\n\n— racing deadlines,\nburn timers, and a combo bonus for consecutive successful dishes. Three deliberate changes from\nOvercooked:\n\n**Latency is the game.** Every model response first charges its thinking time to the shared world clock, then its actions execute. (You can chain several calls in one response and pay the latency once — decisiveness is rewarded.)**No joystick skills.** The chef walks itself to the right station automatically; travel time is charged inside the action. What's being tested is*choosing the right action sequence under time pressure*, not video-game reflexes.**Fully deterministic.** Same seed, same actions, same latencies → exactly the same episode, every time, on any machine. Every run can be replayed in a browser viewer and audited.\n\nEvery episode produces a single 0–100 score we call **KR** (the **Kitchen Rush score**). It's\ngraded on a curve between two fixed anchors: KR 0 means \"no better than doing nothing and\nletting every order expire,\" and KR 100 means \"matched a scripted reference chef that plays\nthe same kitchen with zero latency.\"\n\nA worked example makes it concrete. Say that on one kitchen the do-nothing chef finishes at\n**−60** points (every order expired), the zero-latency reference chef finishes at **+140**,\nand your model finishes at **+40**. There are 200 points between the two anchors and your\nmodel covered 100 of them, so its KR is **50** — it closed half the gap to the reference.\nAverage that over many seeded kitchens and you have the leaderboard number\n([docs/METHODOLOGY.md](/bassimeledath/kitchen-rush/blob/main/docs/METHODOLOGY.md) has the full formula).\n\nHere's the knob that makes Kitchen Rush flexible: every kitchen is generated **at a latency\nbudget B** (\n\n`--latency-budget`\n\n, in seconds per decision). Think of B as **the pace the kitchen is priced for**: order deadlines are set so that a chef spending exactly B seconds on each decision can finish every order, with roughly 1.4–1.6× headroom to spare. Each B gets its own leaderboard — results at different budgets are never averaged together.\n\nFor the mathematically inclined, the pricing is exact:\n\n```\ndeadline = arrival + ⌈σ · C(B)⌉,   where C(B) = A + K·B\n```\n\n`A`\n\nis the order's intrinsic cooking/walking time, `K`\n\nis how many decisions a competent plan\nneeds, and σ is the headroom (1.4–1.6 by tier). So a model that actually decides in ℓ seconds\ngains or loses `K·(B − ℓ)`\n\nseconds of breathing room per order. Faster than B? You bank slack\nand serve while orders are still worth full value. Slower? You eat through the headroom, and\norders start becoming unfinishable at around `ℓ ≈ B + (σ−1)·C(B)/K`\n\n— about 3–4 s/decision at\nB=1 on the current tiers, which is exactly where our calibration sweep shows the reference\nchef collapsing ([docs/METHODOLOGY.md §2](/bassimeledath/kitchen-rush/blob/main/docs/METHODOLOGY.md),\n[docs/CALIBRATION.md](/bassimeledath/kitchen-rush/blob/main/docs/CALIBRATION.md)).\n\nAnd in plain deployment terms: **the model that wins at B=1s is the best pick when every\ndecision has to land in about a second** — on the benchmark's reproducible clock that's a\nbudget of roughly 65 output tokens per decision, i.e. terse, single-shot tool dispatch — what a\nvoice agent needs. **B=5s** buys about 730 tokens per decision — enough for a short burst of\nreasoning, what an interactive assistant can afford. The same model can rank very differently on the\ntwo boards, and that reordering is precisely what the benchmark is for.\n\n17 model configurations × 12 seeds × {medium, hard} kitchens × two latency budgets — 816\nepisodes so far. Each chart is one latency budget; bars are mean KR, whiskers are 95%\nconfidence intervals. The full per-tier table (with costs, reasoning tokens, and serve rates)\nis at [leaderboard/results/board.md](/bassimeledath/kitchen-rush/blob/main/leaderboard/results/board.md).\n\n**The left board (B=1s)** is the realtime test: the kitchen is priced for one second per\ndecision, which on the benchmark's clock buys about 65 output tokens — terse, single-shot tool\ndispatch. Winning here means \"the model I'd trust to drive a voice agent or a live dashboard.\"\n**The right board (B=5s)** prices the same kitchens for five seconds per decision (~730\ntokens — room for a short burst of reasoning), what an interactive assistant can afford.\n\nRead them side by side — that contrast is the product. Under tight realtime pressure (B=1s)\nthe fast no-reasoning models hold the podium: `gemini-3.1-flash-lite`\n\nruns nearly even with\n`claude-sonnet-4.6`\n\n(32 vs 37). Give every decision five seconds instead and the board\nreorders: `gpt-5.4-mini`\n\nwith low reasoning rockets from near-zero to a **dead heat with\nsonnet (44 vs 44) at about a fifth of the cost**, while flash-lite *drops* to half its B=1\nstanding. The same mini with reasoning fully off scores 0.0 at both budgets — reasoning it\ncan't afford at B=1 is exactly what makes it a frontier-level tool caller at B=5. That's the\nlatency tax, made visible. (`·think`\n\nrows ran with reasoning on at low effort; everything\nelse with reasoning off — fast single-shot dispatch is the honest realtime default. One row\nyou might expect is missing: there is no `claude-sonnet-4.6·think`\n\n, because Anthropic's API\ndoes not allow extended thinking when tool calls are forced, and the harness forces tool\ncalls — sonnet competes thinking-off only.)\n\n*The flip, watched live: the same two models from the clip at the top,\nbut in a kitchen priced at B=5s. Now the mini's reasoning burst is affordable — it finishes\nevery order at 99 raw points (KR 86) while sonnet is still cooking at 40. This is the\nmini's best kitchen — the chart above shows the average, a 44–44 tie across all 24 — but the\ndirection is real: it wins the medium tier at B=5 outright (59 vs 52). Same models, different\nlatency budget, different winner: that's exactly what the two boards measure.*\n\nTwo minutes — run the scripted reference chef locally (no model calls):\n\n```\npip install -e .                          # the core has zero dependencies\nkitchenrush bench --baseline random --tier easy --seeds 12 --trials 2\nkitchenrush calibrate --tier easy --latency-budget 1   # see how the reference chef degrades with latency\n\n# watch a game in the browser (scripted chef):\nkitchenrush replay --oracle --tier easy --seed 0       # writes ui/replays/easy_seed0.json\ncd ui && python3 -m http.server 8000                   # then open http://localhost:8000\n# ...or race up to 4 models side-by-side on one clock: ?replays=a.json,b.json (see ui/README.md)\n```\n\nTo benchmark a real model, add provider support and your API key:\n\n```\npip install -e '.[providers]'\nkitchenrush bench --model anthropic:claude-sonnet-4-6 --tier medium --latency-budget 1\n```\n\nAny LiteLLM-routable model works via `provider:model`\n\n. You can also plug in a fully custom\nclient — it only needs a `name`\n\nand a `generate(system, messages, tools) -> ModelResponse`\n\nmethod, registered with `register_adapter`\n\n. CLI commands: `run`\n\n, `bench`\n\n, `replay`\n\n, `seeds`\n\n,\n`calibrate`\n\n.\n\n[docs/RULES.md](/bassimeledath/kitchen-rush/blob/main/docs/RULES.md)— the authoritative, code-verified ruleset[docs/METHODOLOGY.md](/bassimeledath/kitchen-rush/blob/main/docs/METHODOLOGY.md)— the KR metric, the math of B, statistical protocol[docs/CALIBRATION.md](/bassimeledath/kitchen-rush/blob/main/docs/CALIBRATION.md)— the evidence behind the gen-1.0 freeze[docs/LIMITATIONS.md](/bassimeledath/kitchen-rush/blob/main/docs/LIMITATIONS.md)— what KR does and doesn't measure (worth reading before citing results)[docs/OBJECTIONS.md](/bassimeledath/kitchen-rush/blob/main/docs/OBJECTIONS.md)— anticipated critiques, answered with data[docs/SUBMISSIONS.md](/bassimeledath/kitchen-rush/blob/main/docs/SUBMISSIONS.md)·[docs/CONTAMINATION.md](/bassimeledath/kitchen-rush/blob/main/docs/CONTAMINATION.md)— leaderboard contract & data hygiene\n\nIf you use Kitchen Rush in your work, please cite it (machine-readable copy in\n[CITATION.cff](/bassimeledath/kitchen-rush/blob/main/CITATION.cff)):\n\n```\n@software{kitchenrush2026,\n  author = {Eledath, Bassim},\n  title  = {Kitchen Rush: A Benchmark for Accurate and Fast Tool Calling},\n  url    = {https://github.com/bassimeledath/kitchen-rush},\n  year   = {2026}\n}\n```\n\nApache-2.0. See [LICENSE](/bassimeledath/kitchen-rush/blob/main/LICENSE).", "url": "https://wpnews.pro/news/show-hn-kitchen-rush-overcooked-inspired-llm-tool-calling-benchmark", "canonical_source": "https://github.com/bassimeledath/kitchen-rush", "published_at": "2026-06-16 07:10:39+00:00", "updated_at": "2026-06-16 07:18:42.844777+00:00", "lang": "en", "topics": ["large-language-models", "ai-agents", "ai-research", "ai-tools", "ai-infrastructure"], "entities": ["Kitchen Rush", "Overcooked", "BFCL", "τ-bench", "ToolSandbox", "AppWorld"], "alternates": {"html": "https://wpnews.pro/news/show-hn-kitchen-rush-overcooked-inspired-llm-tool-calling-benchmark", "markdown": "https://wpnews.pro/news/show-hn-kitchen-rush-overcooked-inspired-llm-tool-calling-benchmark.md", "text": "https://wpnews.pro/news/show-hn-kitchen-rush-overcooked-inspired-llm-tool-calling-benchmark.txt", "jsonld": "https://wpnews.pro/news/show-hn-kitchen-rush-overcooked-inspired-llm-tool-calling-benchmark.jsonld"}}