Atlarix vs opencode on Terminal-Bench 2.0 — same model, only the harness changes (k=1, receipts included)

wpnews.pro

cd /news/ai-agents/atlarix-vs-opencode-on-terminal-benc… · home › topics › ai-agents › article

[ARTICLE · art-43927] src=dev.to ↗ pub=2026-06-29T19:22Z topic=ai-agents verified=true sentiment=· neutral

Atlarix vs opencode on Terminal-Bench 2.0 — same model, only the harness changes (k=1, receipts included)

Atlarix, an agent workstation for open-weight models, resolved 42 out of 89 tasks on Terminal-Bench 2.0 compared to opencode's 39 out of 89, using the same model and identical infrastructure. The 3-task difference falls within k=1 noise, indicating that the harness does not bottleneck model performance.

read3 min views1 publishedJun 29, 2026

I build Atlarix, an agent workstation for open-weight models. The core claim behind it is that the harness — retrieval, tool surface, control loop — is what lets an open-weight model perform, not just the model's raw weights. This post is me trying to falsify that claim with a controlled run, and publishing every output file so you can check it.

Short version: on Terminal-Bench 2.0, single attempt, Atlarix resolved 42/89 and opencode resolved 39/89 on the same model. That 3-task gap is within k=1 noise — I'm not claiming a win. What it shows is that the harness isn't bottlenecking the model. Details and caveats below; raw files at the end.

The only variable is the harness. Everything else is pinned identical across both agents.

terminal-bench/terminal-bench-2

— all 89 tasks, one isolated container each, automated verifiers.minimax/minimax-m3

, routed through OpenRouter, pinned to a single provider at -e modal

), one container per task.-k 1

.--timeout-multiplier 1

(same for both).--max-retries 3

(same for both).

harbor run -d terminal-bench/terminal-bench-2 \
  -m openai/minimax/minimax-m3 \
  -n 24 -k 1 -y --timeout-multiplier 1 --max-retries 3 \
  -e modal --agent-import-path atlarix_tb:AtlarixAgent

harbor run -d terminal-bench/terminal-bench-2 \
  -m bench/minimax/minimax-m3 \
  -n 24 -k 1 -y --timeout-multiplier 1 --max-retries 3 \
  -e modal --agent-import-path atlarix_tb.opencode_proxy:BenchOpenCodeAgent

(-n 24

is concurrency — how many containers run in parallel — not a task count. All 89 tasks run.)

Harness	Resolved	Score
Atlarix	42 / 89	47%
opencode	39 / 89	44%

k=1 means one sample per task. The official Terminal-Bench leaderboard requires k=5 specifically to measure run-to-run variance. A 3-task difference at k=1 is inside that noise band. So this is not a leaderboard result and not a claim that Atlarix beats opencode. The honest takeaway: an open-weight model performs about as well under Atlarix as under a strong existing harness — the harness isn't holding it back.

~25% of tasks timed out — for both harnesses. At native timeout (×1), roughly a quarter of tasks hit AgentTimeoutError

on each side and count as unresolved. So the sub-50% absolute scores aren't all capability failures; a meaningful share are wall-clock on heavy tasks. The timeout ceiling is identical for both agents, so the comparison stays fair — but that's why neither number is higher.

Atlarix's desktop app asks for human approval before every file write and command — a core safety feature. Benchmarks run unattended, so I grant that approval once via an explicit operator flag (ATLARIX_AUTONOMOUS_DANGER=1

). Without it, any task needing an install or privileged command is blocked and fails.

This is not an advantage over opencode — every agent auto-approves to run an automated benchmark; it's inherent to running unattended. Stating it for full transparency. The flag is off by default; the interactive app always asks.

The exact Atlarix bundle I ran is a public, Electron-free headless build: atlarix-headless-linux-amd64.tar.gz

. The benchmark is the open-source Harbor framework. The raw Harbor result files — per-task pass/fail for both harnesses — are published unedited. Nothing is hand-typed.

Everything (raw result.json

for both sides, summary.csv

, exact bundle, full setup): atlarix.dev/benchmark

If you spot something wrong in the result files, that's the point — tell me.

Built in Nairobi.

source & further reading

dev.to — original article AI Code Assistants: Creating Efficiency or Dependency? I got tired of vibe investing, so I built an AI committee that shows its work Why Most AI Trading Bots Fail (And What Ours Did Wrong Too)

~/api · this article 200

$curl api.wpnews.pro/v1/news/atlarix-vs-opencode-on-t…

Read original on dev.to → dev.to/amariahak/atlarix-vs-opencode-on-terminal…

mentioned entities

Atlarix

opencode

Terminal-Bench 2.0

Minimax M3

OpenRouter

Harbor

Modal

metadata

slugatlarix-vs-opencode-on-terminal-bench-2-0-same-model-only-the-harness-changes-k

topic#ai-agents

secondary3 topics

sentimentneutral

canonicaldev.to

navigation

← prev(PR) SEMI Projects 300 mm Memory…

next →Tidal Says It Won’t Pay Royaltie…

── more in #ai-agents 4 stories · sorted by recency

dev.to · 29 Jun · #ai-agents

I got tired of vibe investing, so I built an AI committee that shows its work

dev.to · 29 Jun · #ai-agents

Why Most AI Trading Bots Fail (And What Ours Did Wrong Too)

dev.to · 29 Jun · #ai-agents

100k lines of TypeScript to Rust with zero Rust experience. That's not engineering.

dev.to · 29 Jun · #ai-agents

I built a tool to check what AI agents actually understand about your website

── more on @atlarix 3 stories trending now

wpnews · 28 May · #ai-startups

[AINews] Cognition raises $1B in $26B Series D

wpnews · 5 Jun · #ai-agents

Miasma Worm Targets AI Coding Agents via GitHub Repos

wpnews · 28 May · #ai-startups

The Niche SaaS Opportunity Map 2026: Highly Demanded Subscribed Categories Beyond Mainstream

sponsored brought to you by zahid.host 4,200+ EU-deployed projects

reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main

→ Live at https://your-agent.zahid.host ✓

Get free account → Pricing

from €0/mo · no card required