{"slug": "atlarix-vs-opencode-on-terminal-bench-2-0-same-model-only-the-harness-changes-k", "title": "Atlarix vs opencode on Terminal-Bench 2.0 — same model, only the harness changes (k=1, receipts included)", "summary": "Atlarix, an agent workstation for open-weight models, resolved 42 out of 89 tasks on Terminal-Bench 2.0 compared to opencode's 39 out of 89, using the same model and identical infrastructure. The 3-task difference falls within k=1 noise, indicating that the harness does not bottleneck model performance.", "body_md": "I build [Atlarix](https://atlarix.dev), an agent workstation for open-weight models. The core claim behind it is that the harness — retrieval, tool surface, control loop — is what lets an open-weight model perform, not just the model's raw weights. This post is me trying to falsify that claim with a controlled run, and publishing every output file so you can check it.\n\nShort version: on Terminal-Bench 2.0, single attempt, **Atlarix resolved 42/89 and opencode resolved 39/89** on the same model. That 3-task gap is **within k=1 noise** — I'm not claiming a win. What it shows is that the harness isn't bottlenecking the model. Details and caveats below; raw files at the end.\n\nThe only variable is the harness. Everything else is pinned identical across both agents.\n\n`terminal-bench/terminal-bench-2`\n\n— all 89 tasks, one isolated container each, automated verifiers.`minimax/minimax-m3`\n\n, routed through OpenRouter, pinned to a single provider at `-e modal`\n\n), one container per task.`-k 1`\n\n.`--timeout-multiplier 1`\n\n(same for both).`--max-retries 3`\n\n(same for both).\n\n```\n# Atlarix harness\nharbor run -d terminal-bench/terminal-bench-2 \\\n  -m openai/minimax/minimax-m3 \\\n  -n 24 -k 1 -y --timeout-multiplier 1 --max-retries 3 \\\n  -e modal --agent-import-path atlarix_tb:AtlarixAgent\n\n# opencode harness (same model + provider + infra)\nharbor run -d terminal-bench/terminal-bench-2 \\\n  -m bench/minimax/minimax-m3 \\\n  -n 24 -k 1 -y --timeout-multiplier 1 --max-retries 3 \\\n  -e modal --agent-import-path atlarix_tb.opencode_proxy:BenchOpenCodeAgent\n```\n\n(`-n 24`\n\nis concurrency — how many containers run in parallel — not a task count. All 89 tasks run.)\n\n| Harness | Resolved | Score |\n|---|---|---|\n| Atlarix | 42 / 89 | 47% |\n| opencode | 39 / 89 | 44% |\n\n**k=1 means one sample per task.** The official Terminal-Bench leaderboard requires **k=5** specifically to measure run-to-run variance. A 3-task difference at k=1 is inside that noise band. So this is **not** a leaderboard result and not a claim that Atlarix beats opencode. The honest takeaway: an open-weight model performs about as well under Atlarix as under a strong existing harness — the harness isn't holding it back.\n\n**~25% of tasks timed out — for both harnesses.** At native timeout (×1), roughly a quarter of tasks hit `AgentTimeoutError`\n\non each side and count as unresolved. So the sub-50% absolute scores aren't all capability failures; a meaningful share are wall-clock on heavy tasks. The timeout ceiling is identical for both agents, so the comparison stays fair — but that's why neither number is higher.\n\nAtlarix's desktop app asks for human approval before every file write and command — a core safety feature. Benchmarks run unattended, so I grant that approval once via an explicit operator flag (`ATLARIX_AUTONOMOUS_DANGER=1`\n\n). Without it, any task needing an install or privileged command is blocked and fails.\n\nThis is **not** an advantage over opencode — every agent auto-approves to run an automated benchmark; it's inherent to running unattended. Stating it for full transparency. The flag is off by default; the interactive app always asks.\n\nThe exact Atlarix bundle I ran is a public, Electron-free headless build: `atlarix-headless-linux-amd64.tar.gz`\n\n. The benchmark is the open-source Harbor framework. The raw Harbor result files — per-task pass/fail for both harnesses — are published unedited. Nothing is hand-typed.\n\nEverything (raw `result.json`\n\nfor both sides, `summary.csv`\n\n, exact bundle, full setup): [atlarix.dev/benchmark](https://atlarix.dev/benchmark)\n\nIf you spot something wrong in the result files, that's the point — tell me.\n\n*Built in Nairobi.*", "url": "https://wpnews.pro/news/atlarix-vs-opencode-on-terminal-bench-2-0-same-model-only-the-harness-changes-k", "canonical_source": "https://dev.to/amariahak/atlarix-vs-opencode-on-terminal-bench-20-same-model-only-the-harness-changes-k1-receipts-54nk", "published_at": "2026-06-29 19:22:30+00:00", "updated_at": "2026-06-29 19:48:39.061810+00:00", "lang": "en", "topics": ["ai-agents", "developer-tools", "large-language-models", "ai-research"], "entities": ["Atlarix", "opencode", "Terminal-Bench 2.0", "Minimax M3", "OpenRouter", "Harbor", "Modal"], "alternates": {"html": "https://wpnews.pro/news/atlarix-vs-opencode-on-terminal-bench-2-0-same-model-only-the-harness-changes-k", "markdown": "https://wpnews.pro/news/atlarix-vs-opencode-on-terminal-bench-2-0-same-model-only-the-harness-changes-k.md", "text": "https://wpnews.pro/news/atlarix-vs-opencode-on-terminal-bench-2-0-same-model-only-the-harness-changes-k.txt", "jsonld": "https://wpnews.pro/news/atlarix-vs-opencode-on-terminal-bench-2-0-same-model-only-the-harness-changes-k.jsonld"}}