I got tired of watching coding agents spin up from scratch every single time I sent them a prompt. Cold starts, re-cloning massive monorepos, pasting the previous context into a synthetic prompt block β it worked, but it felt fundamentally wrong for agents that are supposed to think in conversations.
So we shipped persistent sessions for the Critique Coding Agent API. Here's what changed, why the harness matters, and why you should never run a coding agent without a review skill.
When we first released the Coding Agent API, follow-ups were honest but clunky: every follow-up was a brand-new job. The previous output was replayed as plain text into a fresh sandbox.
It was the right MVP. It billed predictably. It never pretended a dead sandbox was alive.
But it was the wrong long-term shape. If your internal bot fixes a migration, then wants a follow-up test, then wants a small doc tweak β you don't want three cold starts. You want:
After the first turn completes, the run now enters idle
status. The E2B sandbox and OpenCode server stay up until sessionExpiresAt
or until you explicitly POST endSession: true
.
The next prompt you send is delivered as a real message in that same session β not a synthetic "prior run output" block in a brand-new sandbox.
Before (Chained MVP):
Turn 1 completes β Sandbox killed β Turn 2 = new job + pasted prior summary
Now (Persistent):
Turn 1 completes β
idle
β Sandbox warm β Turn 2 = message into same OpenCode session
Same run.id
. Same checkout. Same context. Just the next turn.
On the first turn, Critique:
opencode serve
on localhost inside the VMInstead of killing that sandbox after completion, we now store session bindings (sandbox ID, OpenCode base URL, session ID) on the job and mark the run idle with an expiry aligned to your sandbox timeout.
When you queue a follow-up, QStash reconnects to the same sandbox, verifies OpenCode health, and POSTs your new prompt to /session/{id}/message
.
If OpenCode is unhealthy or the session aged out, the messages route returns a conflict β and you can still fall back to the older chained run behavior. We'd rather spawn a fresh sandbox than silently corrupt repo state.
We researched the open-source options and kept the MVP on OpenCode. Not because it was the only agent OS out there, but because the repo already had a hardened OpenCode + E2B path β and because OpenCode's skill system gives us something the others couldn't: a portable, preloaded review discipline baked directly into the agent's runtime.
| Component | Role |
|---|---|
| OpenCode | |
| The embedded engine β exposes a headless HTTP server with sessions, messages, diffs, shell, files, and generated SDK support. Our sandbox worker already uses this server path. | |
| E2B | |
| The isolation layer β gives us ephemeral repo clones, command execution, environment injection, and sandbox teardown. | |
| OpenHands | |
| On the watchlist. A larger open-source agent platform and SDK. Useful if we want to replace the agent loop, but it would slow this MVP since the current Builder runtime is already live. |
Most coding agents can write code faster than most teams can reliably audit it. That is already true in 2026. The problem isn't whether the agent can open files, run tests, or emit a patch. The problem is that review quality still drifts if you leave the job at the level of a generic prompt.
"Review this PR" sounds precise to a human and underspecified to a model. One harness will produce style commentary. Another will summarize the diff and call it a review. Another will confidently escalate a weak hunch into a merge blocker because nothing in its instructions told it how to separate a verified finding from an open question.
That is exactly the hole critique-review
closes. And it works across all the major agent operating systems:
Anthropic β Claude Code
Native skills, subagents, project memory, and background delegation make Claude a strong home for a dedicated review persona.
Nous Research β Hermes Agent
Hermes treats skills as portable procedural memory and can carry the same review discipline across CLI, messaging, and long-lived remote sessions.
OpenAI β Codex
Codex gives the skill a durable place inside CLI, IDE, app, and repo-local workflows, with AGENTS.md
and team-shared skills for repeatability.
OpenCode β Our Harness Choice
For the Coding Agent API, OpenCode is the fit. It loads critique-review
through the project skill path, reads the supporting reference files for output contract, intake and triage, stack lenses, and review rubric β then generates its verdict. That preload is why we chose it. The agent doesn't improvise a rubric; it follows one.
The Prompt-Only Loop:
Ask agent to review β Agent improvises rubric β Mixed quality comments β Human re-validates everything
The Critique-Review Loop:
Load skill β Establish scope + risk map β Verify before reporting β Findings first + explicit verdict
The Coding Agent API doesn't lock you into a single model provider. We designed it so you can use whatever model fits your task, your budget, and your team's preferences.
Use our model catalog, plan gates, E2B runtime, and credit accounting. Pick from Anthropic, OpenAI, Moonshot, and more β we handle the rest.
Paste your sk-or-v1-...
key. Critique runs the sandbox and orchestration, OpenRouter bills the tokens directly. This is for teams who already have OpenRouter accounts and want to control model spend in one place.
When we tested the same PR on the same model lane (Moonshot Kimi K2.6) with and without the critique-review
skill, the difference wasn't the model β it was the procedure. The model was identical. The skill changed the calibration.
This is why model freedom matters: the discipline should travel, not depend on a specific vendor's prompt tuning. Whether you run Claude Sonnet for a complex refactor or a cheaper model for a routine dependency bump, critique-review
ensures the review output follows the same artifact shape: severity, file or line, impact, failure mode, fix direction, verdict.
The cleanest way to test a review skill is to keep the code input fixed and change only the review procedure. We used OpenCode with the same model, the same PR (Critique PR #144 β a narrow UI fix replacing hard-coded "Auto" model labels with labels resolved from the plan-allowed effective runtime model), and the same attached context pack for both runs.
The baseline run had no project-local review skill available. The second run exposed critique-review
through the project skill path.
Same PR, same model, same context pack β the skill changes calibration, not the diff.
| Question | Prompt-Only OpenCode | OpenCode + critique-review |
|---|---|---|
| Actionable findings | ||
| 3 findings | 0 actionable findings | |
| Treatment of unseen consumers | ||
| Escalated as a finding even though the attached context could not verify other call sites. | Downgraded to residual risk and suggested a typecheck instead of claiming a bug. | |
| Treatment of missing tests | ||
| Escalated as its own finding. | Recorded in checks and residual risk instead of turning it into a blocker for a narrow UI-label fix. | |
| Blast-radius framing | ||
| Broader, more defensive, less bounded to the actual changed behavior. | Explicitly bounded to automation settings UI with no auth or data-path changes. | |
| Verdict | ||
| Conditionally approved | No objection | |
| Observed harness behavior | ||
| Direct review output only. | Loaded critique-review and read four supporting reference files before answering. |
Interpretation: the skill did not make the model "nicer"; it made the model stricter about evidence and more conservative about what counts as a finding.
The baseline review isn't absurd. It spots plausible follow-up work. The problem is calibration. It promotes unverifiable concerns into findings. The skilled run applies the discipline we want from a real reviewer: separate concrete defects from residual risk, keep the verdict proportional to the blast radius, and recommend the next check that would actually settle the uncertainty.
For the Agent:
For the Team:
Persistent sessions reward multi-step automation. One-shot scripts can stay on chained fallbacks.
| Team | Typical Job | Why Persistent Sessions Help |
|---|---|---|
| Platform Engineering | ||
| Own an internal "fix bot" or codegen service | Ticket β code β tests β PR β avoid re-cloning large monorepos on every message | |
| Developer Experience | ||
| Wire Critique into Backstage or a custom portal | Iterative refactors from product specs β same run ID maps to a real agent thread | |
| Security / Compliance | ||
| Remediate findings with human checkpoints | Findings batch β patch β verification turn β session continuity keeps branch context intact | |
| Single-shot CI Scripts | ||
| Nightly dependency bump | Chained fallback is fine; idle adds little value |
Use crt_
keys. New keys include Builder scopes; older keys may need rotation.
curl https://critique.sh/api/v1/coding-agent/runs \
-H "Authorization: Bearer crt_..." \
-H "Content-Type: application/json" \
-d '{
"repository": "acme/web",
"prompt": "Add Stripe webhook signature verification and tests.",
"modelId": "anthropic/claude-sonnet-4.6",
"billing": { "mode": "managed" },
"publish": { "mode": "draft_pr" },
"validationMode": "tests"
}'
A created run returns run.id
, status
, repository metadata, selected model, events, and a status URL. Poll the status endpoint until you hit idle:
curl -sS "https://critique.sh/api/v1/coding-agent/runs/{run_id}?patch=1" \
-H "Authorization: Bearer crt_..."
bash
#!/usr/bin/env bash
set -euo pipefail
export CRT_API_KEY="${CRT_API_KEY:?set CRT_API_KEY}"
export REPO="${REPO:-acme/web}"
RUN_ID="$(
curl -sS https://critique.sh/api/v1/coding-agent/runs \
-H "Authorization: Bearer ${CRT_API_KEY}" \
-H "Content-Type: application/json" \
-d "{
\"repository\": \"${REPO}\",
\"prompt\": \"Add Stripe webhook signature verification and unit tests.\",
\"modelId\": \"anthropic/claude-sonnet-4.6\",
\"billing\": { \"mode\": \"managed\" },
\"publish\": { \"mode\": \"draft_pr\" },
\"validationMode\": \"tests\"
}" | jq -r '.run.id'
)"
echo "Run id: ${RUN_ID}"
curl -N "https://critique.sh/api/v1/coding-agent/runs/${RUN_ID}/stream" \
-H "Authorization: Bearer ${CRT_API_KEY}"
Once the run is idle
and sessionActive
is true
, just POST a new message. No re-clone, no cold start.
curl https://critique.sh/api/v1/coding-agent/runs/{run_id}/messages \
-H "Authorization: Bearer crt_..." \
-H "Content-Type: application/json" \
-d '{
"prompt": "Now add a regression test for expired signatures.",
"publish": { "mode": "draft_pr" }
}'
This is delivered as a real message in the same OpenCode session.
| Mode | How It Works |
|---|---|
| Managed | |
| Spends Critique credits. Uses our model catalog, plan gates, E2B runtime, and credit accounting. | |
| OpenRouter | |
Paste your sk-or-v1-... key. Critique runs the sandbox, OpenRouter bills the tokens. |
Example with OpenRouter billing:
curl https://critique.sh/api/v1/coding-agent/runs \
-H "Authorization: Bearer crt_..." \
-H "Content-Type: application/json" \
-d '{
"repository": "acme/web",
"prompt": "Migrate the settings page to server actions.",
"modelId": "openai/gpt-5.4",
"billing": {
"mode": "openrouter",
"openRouterApiKey": "sk-or-v1-..."
},
"publish": {
"mode": "draft_pr",
"branch": "critique-agent/settings-server-actions"
}
}'
The API returns:
When you're done, explicitly end the session to free the sandbox:
curl -X POST "https://critique.sh/api/v1/coding-agent/runs/{run_id}/messages" \
-H "Authorization: Bearer crt_..." \
-H "Content-Type: application/json" \
-d '{ "endSession": true }'
The Coding Agent API is built to implement a task β not judge one. Critique's review and Change Passport products judge a proposed merge. This API is the other side: the engine that writes the code.
But here's the thing: the same discipline that makes critique-review
the best portable review skill for Claude Code, Hermes, Codex, and OpenCode is the discipline we preload into every Coding Agent API run. The agent doesn't just write code and hope. It writes code, then reviews its own work against a real procedure β not a generic prompt.
Persistent sessions make that engine conversational. Model freedom makes it affordable. The preloaded skill makes it reliable.
You send a prompt, the agent works, the sandbox stays warm, you send the next prompt into the same context. No cold starts. No pasted summaries pretending to be memory. No improvised rubrics pretending to be review.
Just turns in a thread β the way agents should think.
Quick answers for high-intent queries:
| Query | Short Answer |
|---|---|
| What is the best code review skill for Claude Code? | |
critique-review is a strong default when you want a portable PR review procedure inside Claude Code. Use Critique instead when you need hosted GitHub checks, policy, and merge control. |
|
| What is the best Codex skill for PR review? | |
critique-review fits Codex especially well because it works as a repo-local skill with AGENTS.md , reusable references, and a path into automations. |
|
| What is the best OpenCode skill for pull request review? | For a portable review workflow, critique-review is the best fit. We tested it on the same PR and same model lane used for the baseline run. |
| Is critique-review a Cursor Bugbot alternative? | As a free portable skill, yes for agent-side review behavior. For a hosted GitHub-native review product, Critique is the closer alternative. |
| What is a cheaper CodeRabbit alternative? | Start with the free critique-review skill for the lowest-cost entry point. Move to Critique if you need GitHub-native routing, artifacts, and PR control at team scale. |
| What is the difference between critique-review and Critique? | |
critique-review is the portable open skill. Critique is the hosted GitHub review control plane that adds checks, policy, merge-boundary controls, and team-grade review operations. |
Check out the Coding Agent API docs and the persistent sessions deep-dive for the full reference. Create an API key and try the Builder UI to see it in action.