Show HN: Kitchen Rush, Overcooked inspired LLM tool calling benchmark

Kitchen Rush, a new benchmark for evaluating large language model tool-calling, measures both accuracy and latency by simulating an Overcooked-style kitchen where thinking time directly impacts game performance. The benchmark produces a single score, KR, that combines speed and correctness, with separate leaderboards for different latency budgets.

An agent tool-calling benchmark where latency matters as much as intelligence. Most tool-calling benchmarks BFCL, τ-bench, ToolSandbox, AppWorld check whether a model makes the right calls — and the world politely waits while it thinks. That's fine for offline tasks. But if you're building a voice assistant, a live-ops agent, or anything realtime, you care about two things at once: does the model do the right thing, and does it do it fast enough? A model that finds the perfect answer after thirty seconds of reasoning is, for you, the wrong model. Kitchen Rush measures both at once, by construction: the time a model spends thinking is converted into game time that passes before its actions land. While the model deliberates, food keeps cooking, food burns, and order deadlines slip away. Speed and accuracy aren't two charts you squint at — they're one score, experienced the way a deployment would experience them. The model plays a chef in an Overcooked https://github.com/HumanCompatibleAI/overcooked ai -style kitchen. Orders stream in burgers, soups, ramen… , and the model fulfils them with ordinary native function calls — collect , chop , cook , plate , serve — racing deadlines, burn timers, and a combo bonus for consecutive successful dishes. Three deliberate changes from Overcooked: Latency is the game. Every model response first charges its thinking time to the shared world clock, then its actions execute. You can chain several calls in one response and pay the latency once — decisiveness is rewarded. No joystick skills. The chef walks itself to the right station automatically; travel time is charged inside the action. What's being tested is choosing the right action sequence under time pressure , not video-game reflexes. Fully deterministic. Same seed, same actions, same latencies → exactly the same episode, every time, on any machine. Every run can be replayed in a browser viewer and audited. Every episode produces a single 0–100 score we call KR the Kitchen Rush score . It's graded on a curve between two fixed anchors: KR 0 means "no better than doing nothing and letting every order expire," and KR 100 means "matched a scripted reference chef that plays the same kitchen with zero latency." A worked example makes it concrete. Say that on one kitchen the do-nothing chef finishes at −60 points every order expired , the zero-latency reference chef finishes at +140 , and your model finishes at +40 . There are 200 points between the two anchors and your model covered 100 of them, so its KR is 50 — it closed half the gap to the reference. Average that over many seeded kitchens and you have the leaderboard number docs/METHODOLOGY.md /bassimeledath/kitchen-rush/blob/main/docs/METHODOLOGY.md has the full formula . Here's the knob that makes Kitchen Rush flexible: every kitchen is generated at a latency budget B --latency-budget , in seconds per decision . Think of B as the pace the kitchen is priced for : order deadlines are set so that a chef spending exactly B seconds on each decision can finish every order, with roughly 1.4–1.6× headroom to spare. Each B gets its own leaderboard — results at different budgets are never averaged together. For the mathematically inclined, the pricing is exact: deadline = arrival + ⌈σ · C B ⌉, where C B = A + K·B A is the order's intrinsic cooking/walking time, K is how many decisions a competent plan needs, and σ is the headroom 1.4–1.6 by tier . So a model that actually decides in ℓ seconds gains or loses K· B − ℓ seconds of breathing room per order. Faster than B? You bank slack and serve while orders are still worth full value. Slower? You eat through the headroom, and orders start becoming unfinishable at around ℓ ≈ B + σ−1 ·C B /K — about 3–4 s/decision at B=1 on the current tiers, which is exactly where our calibration sweep shows the reference chef collapsing docs/METHODOLOGY.md §2 /bassimeledath/kitchen-rush/blob/main/docs/METHODOLOGY.md , docs/CALIBRATION.md /bassimeledath/kitchen-rush/blob/main/docs/CALIBRATION.md . And in plain deployment terms: the model that wins at B=1s is the best pick when every decision has to land in about a second — on the benchmark's reproducible clock that's a budget of roughly 65 output tokens per decision, i.e. terse, single-shot tool dispatch — what a voice agent needs. B=5s buys about 730 tokens per decision — enough for a short burst of reasoning, what an interactive assistant can afford. The same model can rank very differently on the two boards, and that reordering is precisely what the benchmark is for. 17 model configurations × 12 seeds × {medium, hard} kitchens × two latency budgets — 816 episodes so far. Each chart is one latency budget; bars are mean KR, whiskers are 95% confidence intervals. The full per-tier table with costs, reasoning tokens, and serve rates is at leaderboard/results/board.md /bassimeledath/kitchen-rush/blob/main/leaderboard/results/board.md . The left board B=1s is the realtime test: the kitchen is priced for one second per decision, which on the benchmark's clock buys about 65 output tokens — terse, single-shot tool dispatch. Winning here means "the model I'd trust to drive a voice agent or a live dashboard." The right board B=5s prices the same kitchens for five seconds per decision ~730 tokens — room for a short burst of reasoning , what an interactive assistant can afford. Read them side by side — that contrast is the product. Under tight realtime pressure B=1s the fast no-reasoning models hold the podium: gemini-3.1-flash-lite runs nearly even with claude-sonnet-4.6 32 vs 37 . Give every decision five seconds instead and the board reorders: gpt-5.4-mini with low reasoning rockets from near-zero to a dead heat with sonnet 44 vs 44 at about a fifth of the cost , while flash-lite drops to half its B=1 standing. The same mini with reasoning fully off scores 0.0 at both budgets — reasoning it can't afford at B=1 is exactly what makes it a frontier-level tool caller at B=5. That's the latency tax, made visible. ·think rows ran with reasoning on at low effort; everything else with reasoning off — fast single-shot dispatch is the honest realtime default. One row you might expect is missing: there is no claude-sonnet-4.6·think , because Anthropic's API does not allow extended thinking when tool calls are forced, and the harness forces tool calls — sonnet competes thinking-off only. The flip, watched live: the same two models from the clip at the top, but in a kitchen priced at B=5s. Now the mini's reasoning burst is affordable — it finishes every order at 99 raw points KR 86 while sonnet is still cooking at 40. This is the mini's best kitchen — the chart above shows the average, a 44–44 tie across all 24 — but the direction is real: it wins the medium tier at B=5 outright 59 vs 52 . Same models, different latency budget, different winner: that's exactly what the two boards measure. Two minutes — run the scripted reference chef locally no model calls : pip install -e . the core has zero dependencies kitchenrush bench --baseline random --tier easy --seeds 12 --trials 2 kitchenrush calibrate --tier easy --latency-budget 1 see how the reference chef degrades with latency watch a game in the browser scripted chef : kitchenrush replay --oracle --tier easy --seed 0 writes ui/replays/easy seed0.json cd ui && python3 -m http.server 8000 then open http://localhost:8000 ...or race up to 4 models side-by-side on one clock: ?replays=a.json,b.json see ui/README.md To benchmark a real model, add provider support and your API key: pip install -e '. providers ' kitchenrush bench --model anthropic:claude-sonnet-4-6 --tier medium --latency-budget 1 Any LiteLLM-routable model works via provider:model . You can also plug in a fully custom client — it only needs a name and a generate system, messages, tools - ModelResponse method, registered with register adapter . CLI commands: run , bench , replay , seeds , calibrate . docs/RULES.md /bassimeledath/kitchen-rush/blob/main/docs/RULES.md — the authoritative, code-verified ruleset docs/METHODOLOGY.md /bassimeledath/kitchen-rush/blob/main/docs/METHODOLOGY.md — the KR metric, the math of B, statistical protocol docs/CALIBRATION.md /bassimeledath/kitchen-rush/blob/main/docs/CALIBRATION.md — the evidence behind the gen-1.0 freeze docs/LIMITATIONS.md /bassimeledath/kitchen-rush/blob/main/docs/LIMITATIONS.md — what KR does and doesn't measure worth reading before citing results docs/OBJECTIONS.md /bassimeledath/kitchen-rush/blob/main/docs/OBJECTIONS.md — anticipated critiques, answered with data docs/SUBMISSIONS.md /bassimeledath/kitchen-rush/blob/main/docs/SUBMISSIONS.md · docs/CONTAMINATION.md /bassimeledath/kitchen-rush/blob/main/docs/CONTAMINATION.md — leaderboard contract & data hygiene If you use Kitchen Rush in your work, please cite it machine-readable copy in CITATION.cff /bassimeledath/kitchen-rush/blob/main/CITATION.cff : @software{kitchenrush2026, author = {Eledath, Bassim}, title = {Kitchen Rush: A Benchmark for Accurate and Fast Tool Calling}, url = {https://github.com/bassimeledath/kitchen-rush}, year = {2026} } Apache-2.0. See LICENSE /bassimeledath/kitchen-rush/blob/main/LICENSE .