David Just Beat Goliath on Terminal-Bench 2.1

Backboard R-CLI, a small open-source terminal agent, achieved the #1 published score on Terminal-Bench 2.1 with 84.3% accuracy (75/89 tasks), beating larger competitors like Codex CLI and Claude Code. Using the same off-the-shelf Claude Opus 4.8 model as others, the agent's superior performance is attributed to disciplined engineering and system design rather than model size. All results, including logs and verifier reports, are publicly available on GitHub.

The story goes that Goliath showed up in armor, backed by an army, and everyone assumed the fight was already decided. Then a shepherd with a sling walked out instead, and the size of the opponent stopped being the thing that mattered. That's basically what happened on Terminal-Bench 2.1 this week. Terminal-Bench 2.1 is where the frontier labs go to flex. It's 89 hard, realistic terminal tasks: compiling code, debugging broken builds, configuring servers, recovering corrupted data, training models. No multiple choice, no partial credit. An independent verifier checks the final state, and either it passes or it doesn't. The names at the top of that leaderboard are the ones you'd expect: Codex CLI, Claude Code, backed by teams with nine figure compute budgets and the biggest models money can rent. This is Goliath's turf. This week, Backboard R-CLI , a small, open-source terminal agent, walked in with the same off-the-shelf model everyone else has access to Claude Opus 4.8, via Bedrock and put up the 1 published score : 75 of 89 tasks solved, 84.3% accuracy . | Rank | Agent | Model | Accuracy | |---|---|---|---| | 1 | Backboard R-CLI | Claude Opus 4.8 | 84.3% | | 2 | Codex CLI | GPT-5.5 | 83.4% | | 3 | Claude Code | Claude 5 Fable | 83.1% | | 4 | Terminus 2 | Claude 5 Fable | 80.4% | | 5 | Claude Code | Claude Opus 4.8 | 78.9% | Same model, different harness. R-CLI beat the next best Opus 4.8 result by 5.4 points . That gap isn't the model talking. It's the system around it, the sling, not the size of the fighter. No bigger sword here, just better mechanics: None of that is exotic. It's disciplined engineering applied consistently, the kind of thing that's easy to skip when you're racing to ship a wrapper around the newest model. The part we think matters most: everything is public . Task level verifier reports, run configs, pass/fail outcomes, full logs. The whole thing is on GitHub https://github.com/Backboard-io/Backboard-R-CLI-Terminal-Bench-2.1-Results for anyone to inspect or try to break. A benchmark number without the evidence behind it is just a claim. This one isn't. David brought a stone that anyone could pick up and examine. 🔗 https://backboard.io/blog/backboard-r-cli-is-now-1-on-terminal-bench-2.1 https://backboard.io/blog/backboard-r-cli-is-now-1-on-terminal-bench-2.1 Go poke holes in it. That's the point.